TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC Up to 20x


As with all computing, you’ve got to get your math right to do AI well. Because deep learning is a young field, there’s still a lively debate about which types of math are needed, for both training and inferencing.
In November, we explained the differences among popular formats such as single-, double-, half-, multi- and mixed-precision math used in AI and high-performance computing. Today, the NVIDIA Ampere architecture introduces a new approach for improving training performance on the single-precision models widely used for AI.
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations used at the heart of AI and certain HPC applications. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x.
Understanding the New Math
It helps to step back for a second to see how TF32 works and where it fits.
Math formats are like rulers. The number of bits in a format’s exponent determines its range,  how large an object it can measure. Its precision — how fine the lines are on the ruler — comes from the number of bits used for its mantissa, the part of a floating point number after the radix or decimal point.
A good format strikes a balance. It should use enough bits to deliver precision without using so many it slows processing and bloats memory.
The chart below shows how TF32 is a hybrid that strikes this balance for tensor operations.
TF32 strikes a balance that delivers performance with range and accuracy. TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range.
The combination makes TF32 a great alternative to FP32 for crunching through single-precision math, specifically the massive multiply-accumulate functions at the heart of deep learning and many HPC apps.
Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.
For maximum performance, the A100 also has enhanced 16-bit math capabilities.  It supports both FP16 and Bfloat16 (BF16) at double the rate of TF32.  Employing Automatic Mixed Precision , users can get a further 2x higher performance with just a few lines of code.
TF32 is Demonstrating Great Results Today
Compared to FP32, TF32 shows a 6x speed up training BERT , one of today’s most demanding conversational AI models. Applications-level results on other AI training and HPC apps that rely on matrix...

Top