Home > Article > Technology peripherals > The new work of Zhu Jun’s team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!
Quantizes activations, weights and gradients into 4 bits, which is expected to speed up neural network training.
However, existing 4-digit training methods require a custom number format that is not supported by modern hardware.
Recently, Tsinghua Zhu Jun’s team proposed a Transformer training method that uses the INT4 algorithm to implement all matrix multiplications.
Training with ultra-low INT4 accuracy is very challenging. In order to achieve this goal, researchers carefully analyzed the specific structures of activations and gradients in Transformer and proposed dedicated quantizers for them.
For forward propagation, the researchers identified the challenge of outliers and proposed the Hadamard quantizer to suppress outliers.
For backward propagation, they exploit the structural sparsity of gradients by proposing bit partitioning and utilize fractional sampling techniques to accurately quantify gradients.
This new algorithm achieves competitive accuracy on a wide range of tasks, including natural language understanding, machine translation, and image classification.
The prototype linear operator is 2.2 times faster than similar operators in FP16, and the training speed is increased by 35.1%.
Picture
Paper address: https://arxiv.org/abs/2306.11987
Code address: https://github.com/xijiu9/Train_Transformers_with_INT4
Training neural Networks are very computationally demanding. Training using low-precision arithmetic (fully quantized training/FQT) is expected to improve computational and memory efficiency.
The FQT method adds some quantizers and dequantizers to the original full-precision calculation graph, and replaces the higher-cost floating-point operations with less-consuming low-precision floating-point operations. Point operations.
Research on FQT aims to reduce training numerical accuracy without sacrificing too much convergence speed or accuracy.
The required numerical precision has been reduced from FP16 to FP8, INT32 INT8 and INT8 INT5.
FP8 training is implemented in the Nvidia H100 GPU with the Transformer engine, accelerating the training of large-scale Transformers. The recent training numerical accuracy has dropped to 4 digits.
However, these 4-bit training methods cannot be used directly for acceleration because they require custom number formats, which are not supported by modern hardware.
First of all, the non-differentiable quantizer in forward propagation will make the loss situation bumpy, and the gradient-based optimizer can easily fall into a local optimum.
Secondly, the gradient is only approximately calculated with low precision. Such imprecise gradients can slow down the training process and even cause training to become unstable or diverge.
In this work, the researchers proposed a novel INT4 training algorithm for Transformer.
Picture
All high-cost linear operations for training Transformer can be written in the form of matrix multiplication (MM) .
This MM form allows us to design a more flexible quantizer, which can better approximate FP32 matrix multiplication by utilizing the specific structure of activations, weights and gradients in the Transformer. .
Advances in the field of Random Numerical Linear Algebra (RandNLA) are fully exploited by this quantizer.
For forward propagation, researchers found that outliers in activation are the main reason for the decrease in accuracy.
To suppress outliers, they proposed the Hadamard quantizer, which quantizes the transformed version of the activation matrix. This transformation is a block diagonal Hadamard matrix, which propagates the information carried in the outliers to neighboring entries of the matrix, thereby narrowing the numerical range of the outliers.
For backpropagation, they exploit the structural sparsity of the activation gradient. Researchers found that some tokens have very large gradients. At the same time, the gradients of most other tokens are very uniform, even more uniform than the quantized residuals of large gradients.
Picture
Therefore, rather than computing all gradients, it is better to save computational resources in computing the residuals of larger gradients.
In order to take advantage of this sparsity, researchers proposed bit partitioning, which divides the gradient of each token into high 4 bits and low 4 bits.
Then, the most informative gradient is selected through leverage score sampling, which is an important sampling technique of RandNLA.
Picture
Combining the quantification technology of forward and backward propagation, the researcher proposed a method to use INT4MM for Transformer Algorithms for all linear operations, and evaluate algorithms for training Transformers on a variety of tasks, including natural language understanding, question answering, machine translation, and image classification.
Their algorithm achieves competitive or higher accuracy compared to existing 4-bit training algorithms.
Additionally, this algorithm is compatible with contemporary hardware such as GPUs, as it does not require custom number formats such as FP4 or logarithmic formats.
This prototype quantized INT4 MM operator implementation is 2.2 times faster than the FP16MM baseline and increases the training speed by 35.1%.
The Fully Quantized Training (FQT) method Activations, weights, and gradients are quantized to low precision to speed up training, so linear and nonlinear operators during training can be implemented with low-precision arithmetic.
FQT research has designed novel numerical formats and quantization algorithms that can better approximate full-precision tensors.
The current research frontier is 4-bit FQT. FQT is challenging due to the large numerical range of gradients and the optimization problem of training a quantized network from scratch.
Due to these challenges, existing 4-bit FQT algorithms still suffer from a 1-2.5% accuracy loss on some tasks and cannot support contemporary hardware.
Pictures
Mixed experts do not increase Improved model capacity within training budget.
Structural dropout utilizes computationally efficient methods to regularize the model. Efficient attention reduces the quadratic time complexity of computing attention.
The distributed training system reduces training time by utilizing more computing resources.
Researchers’ work to reduce numerical precision is orthogonal to these directions.
Picture
Forward propagation
Neural Network training is an iterative optimization process that computes stochastic gradients through forward and backward propagation.
The research team uses a 4-bit integer (INT4) algorithm to accelerate forward and backward propagation.
Forward propagation can be implemented with a combination of linear and nonlinear (GeLU, normalization, softmax, etc.) operators.
During our training process, we accelerate all linear operators with INT4 arithmetic and keep all computationally less expensive nonlinear operators in 16-bit floating point (FP16) in format.
All linear operations in Transformer can be written in the form of matrix multiplication (MM).
For ease of expression, this article considers the following acceleration of simple matrix multiplication:
Picture
The main use case of this kind of MM is the fully connected layer.
Consider a Transformer whose input shape is (batch size S, sequence length T, dimension D).
The fully connected layer can be expressed as the above formula, where X is the activation of N = STtoken and W is the weight matrix.
For the attention layer, batch matrix multiplication (BMMS) may be required.
Our proposed technology can be applied to BMMS.
In order to speed up training, integer operations must be used to calculate forward propagation.
The researchers utilized Learning Step Quantizer (LSQ) for this purpose.
LSQ is a static quantization. Its quantization scale does not depend on the input method, so it is less expensive than the dynamic method. The quantization method needs to dynamically calculate the quantization scale in each iteration.
Activate outliers
Simply apply LSQ to Activation with 4 bits/ FQT of weights can lead to decreased accuracy because outliers are activated.
Picture
As shown in the figure above, activation has some outlier entries, which are larger than other entries. many.
Unfortunately, Transformers tend to store information in these outliers, and such truncation can seriously hurt accuracy.
The outlier problem is particularly obvious when the training task is to fine-tune a pre-trained model on some new downstream tasks.
Because the pre-trained model contains more outliers than random initialization.
We propose Hadamard quantization (HQ) to solve the outlier problem.
The main idea is to quantize another matrix in a linear space with fewer outliers.
The outliers in the activation matrix form a feature-wise structure.
They are usually concentrated in a few dimensions, that is, only a few columns in X are significantly larger than other columns.
The Hardamand transform is a linear transformation that spreads outliers to other entries.
Backpropagation
Now we consider using INT4 operations to speed up the backward propagation of linear layers.
We will discuss the calculation of activation gradient/weight gradient in this section.
We noticed that the gradient matrix is often very sparse during training.
And the sparsity has such a structure:
# has a few rows (such as tokens) with larger entries, and the larger Most other rows are close to all-zero vectors.
Picture
This structural sparsity results from the severe over-parameterization of modern neural networks.
The network runs in a hyperparameterized scheme for almost the entire training process, and except for a few difficult examples, it adapts well to most training data.
Therefore, for well-fitted data points, the (activation) gradient will be close to zero.
The researchers found that for pre-training tasks, for example, structural sparsity appears quickly after a few training epochs.
For fine-tuning tasks, the gradient is always sparse throughout the training process.
How to design a gradient quantizer to exploit structural sparsity accurately during backpropagation What about calculating MM?
The advanced idea is: many rows of gradients are so small that they have little impact on parameter gradients, but waste a lot of calculations.
On the other hand, large rows cannot be accurately represented by INT4.
We drop some small rows and use the saved computing power to represent large rows more accurately.
Experiments
Researchers evaluate our INT4 training algorithm fine-tuning on a variety of tasks including language models, machine translation and imagery Classification.
The researchers used CUDA and cutlass to execute their proposed HQ-MM and LSS-MM algorithms.
The researchers replaced all floating-point linear operators with INT4 implementations, but did not simply use LSQ to embed the layers, and maintained the accuracy of the last classifier layer.
Finally, the researchers adopted the default architecture, optimizer, scheduler, and hyperparameters for all evaluated models.
The researchers compared the accuracy of the convergence models on various tasks in the table below.
Picture
The comparison methods include full precision training (FP), INT8 training (INT8), FP4 training (" Ultra-low"), 4-bit log quantization using LSQ for activations and weights (LSQ LUQ), and our algorithm that uses HQ for forward propagation and LSS for back propagation (HQ LSS).
"Ultra Low" has no public implementation, so we only list its performance on the original paper on the machine translation task.
With the exception of the large machine translation task and the large visual Transformer task, we repeat each run three times and report the standard deviation as a subscript in the table.
The researchers did not perform any type of knowledge distillation or data augmentation.
The ablation experiment conducted by the researchers was designed to demonstrate the effectiveness of the forward and backward methods.
To study the effectiveness of forward propagation for different quantizers, we leave the backward propagation in FP16.
The results are shown below.
Picture
Finally, the researchers passed the evaluation Their prototype implementation demonstrates the potential of their approach to accelerate neural network training.
And their implementation is not fully optimized yet.
The researchers also did not integrate linear operators with nonlinearity and normalization.
Therefore, the results do not fully reflect the potential of the INT4 training algorithm.
Fully optimized implementation requires extensive engineering and is beyond the scope of our paper.
The researchers proposed a training method for Transformer INT4 that is very friendly to hardware.
By analyzing the properties of MM in Transformer, researchers proposed HQ and LSS methods to quantify activations and gradients while maintaining accuracy.
On several important tasks, our method performs equally well or even better than existing INT4 methods.
The researchers' work may be extended to other MM architectures besides Transformers, such as MLP-Mixer, graph neural networks, and recurrent neural network networks.
This is their future research direction.
Wider impact: The researchers’ algorithm can increase efficiency and reduce the energy consumption of training neural networks, which could help reduce Carbon emissions caused by deep learning.
However, efficient training algorithms may also facilitate the development of large language models and malicious artificial intelligence applications that pose human security risks.
For example, related models and applications that will be used for false content generation.
Limitations: The main limitation of this work is that it can only accelerate large-scale matrix multiplications (linear layers) with larger model, but cannot speed up convolutional layers.
Moreover, the proposed method is not well applicable to very large models such as OPT-175B.
As far as we know, even INT8 training is still an unsolved problem for these very large models.
The above is the detailed content of The new work of Zhu Jun’s team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!. For more information, please follow other related articles on the PHP Chinese website!