## Invented by Naveen Mellempudi, Dipankar Das, Intel Corp

# The Market For Scaling Half-Precision Floating Point Tensors For Training Deep Neural Networks

Neural networks require a range of numbers for training weights, activations (forward pass) and gradients (backpropagation). These require more precision than the standard single-precision FP32 format can provide.

Half-precision floating point tensors offer significant memory savings and increase arithmetic throughput. At present, mixed precision training is the most efficient way to speed up deep neural network training on modern hardware such as NVIDIA GPUs or Cloud TPUs.

## Market Overview

Recently, the market for Scaling half-precision floating point tensors for training deep neural networks has seen a tremendous growth spurt. This growth is being driven by an urgent need to train increasingly complex models in less time and due to more portable implementations that can be applied across numerous applications.

A common approach in deep learning has been using single-precision floating point (FP32) arithmetic to represent parameters. This data type uses 32 bits and offers a wider dynamic range than fixed point numbers can offer; however, it’s expensive and may cause performance issues during inference.

Due to this, more and more applications are transitioning away from FP32 to half-precision arithmetic, which uses only 16 bits. This format has lower memory requirements and can be utilized for larger models or minibatches.

This reduces the total memory required for training by half, eliminating the need to store a master copy of network weights and increasing model speed during back-propagation. Furthermore, activations from each layer are saved and reused during back propagation, eliminating extra memory requirements at this stage.

Though half-precision arithmetic cannot match FP32’s performance, it does provide significant computational speedups during training and evaluation of deep neural networks. This is because these operations are performed on programmable matrix multiplication/accumulate units called Tensor Cores, which on modern NVIDIA GPUs can deliver up to 120 teraFLOP s-1.

Performance of these operations is paramount for training a deep neural network successfully, yet they account for only a small fraction of total work done within a deep neural network compared to non-Tensor Core processes. Therefore, it’s essential to comprehend how these operations contribute to speedups within a DNN, and take steps to optimize other processes for improved overall efficiency.

Though half-precision tensor performance is improving, there remain some obstacles to overcome. One such difficulty lies in arithmetic intensity – how much computational work must be done per kernel per input byte.

## Scaling Floating Point Tensors

Scaling floating point tensors is one of the best methods for cutting back on memory consumption and training time when using deep neural networks. Not only does this save energy, which is great for the environment, but it can speed up various applications like speech recognition, image classification, generative models and industrial recommendation systems as well.

Half-precision floats (FP16 or BFLOAT16) are data types that occupy just 16 bits of memory but can handle wider dynamic ranges than integer or fixed point data types of the same size. This makes them ideal for applications involving complex operations or with large or unknown dynamic ranges.

This feature is essential for many deep learning applications, such as visual and object recognition, speech, text, games and medical imaging. It enables the creation of smaller and more compact models that can be utilized on devices with limited computational resources such as mobile phones or tablets.

FP16 also supports multiple quantization formats, which is advantageous for many AI applications that must convert tensors into their quantized form for faster inference. PyTorch and TensorFlow Lite both support both per-tensor symmetric quantization as well as per channel asymmetric quantization.

These formats can be employed both during training and post-training quantization to optimize model performance in various machine learning frameworks. This approach works particularly well on GPU-accelerated environments like deep learning GPUs.

While most GPUs are optimized for single precision FP32 calculations, mixed precision floats offer more versatility and efficiency when training models. This helps AI researchers and computer programmers balance accuracy against other costs such as memory space or CPU time.

When training deep neural networks, many calculations are performed on the floats representing inputs, activations and weights. Usually these operations use 32-bit floating point floats (FP32 in this article), which require more memory than half-precision floats which require only 16 bits.

The aim is to achieve the accuracy and results desired with minimal memory consumption and processing time. This can be accomplished by selecting a float format with the lowest number of exponent bits possible.

## Mixed Precision Floating Point Tensors

Deep neural networks are becoming a staple for many applications such as computer vision, recommendation engines and language processing. Unfortunately, their larger models require more memory and processing power – necessitating more resources over time.

To combat this issue, researchers have devised techniques for training these models faster. One popular option is mixed precision training, which utilizes half-precision floating point tensors instead of the standard 32-bit FP32 floats used in traditional machine learning applications.

Tensors can reduce memory requirements and accelerate arithmetic on modern GPUs, but they may also lead to errors. Because floating point tensors are not deterministic, errors may occur at any time during training.

Mixed precision in deep learning training can help minimize errors while still delivering high performance during evaluation. This approach is especially advantageous on NVIDIA Volta and Turing architectures, which feature specialized tensor cores.

NVIDIA Volta GPUs such as the Tesla V100 and Quadro V100 feature 640 tensor cores and can deliver up to TFLOPS of mixed FP16-FP32 precision performance. Visit Paperspace GPU Cloud Comparison for a full list of Nvidia Volta-based GPUs along with their performance values.

When creating DNNs, four types of tensors must be taken into consideration: inputs, activations, weights and gradients. From our experience, most of these tensors fall within the range of value magnitudes that can be represented using half-precision format.

For those tensors that do not fit within this range, the simplest solution is to multiply them with a scaling factor (S) before performing a weight update. Doing this allows you to scale gradients up or down at no additional expense.

Loss scaling, a more complex technique for error reduction, is also available to minimize errors. This method takes place during the training step and guarantees that relevant gradient values are recovered.

Loss scaling can either be included in the weight update process or done separately. Either way, gradients must be multiplied by S before being stored in memory.

This technique can enhance model performance by up to three times on modern GPUs and 60% on TPUs, when enabled with the mixed_float16 policy in TensorFlow and PyTorch.

## Floating Point Tensor Formats

The number of floating point tensor formats is vast, yet only a few have become widely used for machine-learning development. Unfortunately, these formats usually need the correct hardware and firmware support in order to run efficiently.

Floating point formats are numerical representations with two bits per digit, typically expressed as bp for binary or 10b+p for decimal. Each bit can be represented either zero or one and all IEEE 754 format types have an emin emax value.

Contrary to popular belief, lower precision computations are not essential for deep learning. In fact, some models can achieve higher accuracy with lower precision calculations. Not only is reduced precision faster and takes up less memory, but it also cuts down on communication time.

Research publications have demonstrated that reduced-precision computations are more efficient than high-precision ones when training and inferring neural networks [3, 4]. Post-training quantization also benefits from this, which is why many AI applications use one of a few integer quantization techniques in floating point formats like Google’s bfloat16 or IEEE’s FP16.

Flexpoint is another promising floating point format, featuring a 16-bit mantissa and 5-bit shared exponent for improved computational efficiency. Not only does this reduce memory and bandwidth requirements, but Flexpoint also uses fixed-point multipliers to simplify training computations.

To reduce overflow, a dequeue is used to store an historical history of G values (the maximum absolute value of each tensor entry) over time. During training, these values are polled periodically to update the exponent. Based on statistical models, forecasted trends in G values can then be estimated and the exponent either increased preemptively to prevent overflow or decreased if projected decline is anticipated.

Flexpoint reduces bandwidth requirements due to the fact that an exponent can be amortized over all entries in a tensor, as shown in Fig. 2. Furthermore, the auto-flex algorithm prevents overflows by monitoring a bounded history of tensor G values and adjusting the exponent accordingly for improved training efficiency.

## The Intel Corp invention works as follows

One embodiment includes a machine learning accelerator device that executes parallel threads in an instruction stream. The multiprocessor includes a compute unit and a set functional units. Each functional unit executes at least one thread of the instruction stream. The compute unit contains compute logic that executes a single instruction to scale an input tensor associated to a layer of a neural net according to a scale factor. The input tensor is stored in floating-point data types, and the compute logic to scale it to allow data distribution of the input data.### Background for Scaling half-precision floating point tensors for training deep neural networks

System Overview

Techniques to Interconnect GPU and Host Processors

Graphics Processing Pipeline.

Low Precision Training for Deep Neural Networks

Scaling Half Precision Floating Point Tensors to Train Deep Neural Networks

Dynamic scaling algorithm:

Machine Learning Overview

GPGPU Machine Learning Acceleration

Machine Learning Neural Network Implementations

Exemplary Machine Learning Apps

Additional Exemplary Grafiks Processing System

Additional Exemplary Grafiks Processing System Overview

Graphics Processing Engine.

Execution Units

Graphics Pipeline

Graphics Pipeline Programming

Graphics Software Architecture.

IP Core Implementations

Exemplary system on a Chip Integrated Circuit.

Click here to view the patent on Google Patents.