search
HomeTechnology peripheralsAIThe new work of Zhu Jun's team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!

Quantizes activations, weights and gradients into 4 bits, which is expected to speed up neural network training.

However, existing 4-digit training methods require a custom number format that is not supported by modern hardware.

Recently, Tsinghua Zhu Jun’s team proposed a Transformer training method that uses the INT4 algorithm to implement all matrix multiplications.

Training with ultra-low INT4 accuracy is very challenging. In order to achieve this goal, researchers carefully analyzed the specific structures of activations and gradients in Transformer and proposed dedicated quantizers for them.

For forward propagation, the researchers identified the challenge of outliers and proposed the Hadamard quantizer to suppress outliers.

For backward propagation, they exploit the structural sparsity of gradients by proposing bit partitioning and utilize fractional sampling techniques to accurately quantify gradients.

This new algorithm achieves competitive accuracy on a wide range of tasks, including natural language understanding, machine translation, and image classification.

The prototype linear operator is 2.2 times faster than similar operators in FP16, and the training speed is increased by 35.1%.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

Paper address: https://arxiv.org/abs/2306.11987

Code address: https://github.com/xijiu9/Train_Transformers_with_INT4

New INT 4 training algorithm

Training neural Networks are very computationally demanding. Training using low-precision arithmetic (fully quantized training/FQT) is expected to improve computational and memory efficiency.

The FQT method adds some quantizers and dequantizers to the original full-precision calculation graph, and replaces the higher-cost floating-point operations with less-consuming low-precision floating-point operations. Point operations.

Research on FQT aims to reduce training numerical accuracy without sacrificing too much convergence speed or accuracy.

The required numerical precision has been reduced from FP16 to FP8, INT32 INT8 and INT8 INT5.

FP8 training is implemented in the Nvidia H100 GPU with the Transformer engine, accelerating the training of large-scale Transformers. The recent training numerical accuracy has dropped to 4 digits.

However, these 4-bit training methods cannot be used directly for acceleration because they require custom number formats, which are not supported by modern hardware.

First of all, the non-differentiable quantizer in forward propagation will make the loss situation bumpy, and the gradient-based optimizer can easily fall into a local optimum.

Secondly, the gradient is only approximately calculated with low precision. Such imprecise gradients can slow down the training process and even cause training to become unstable or diverge.

In this work, the researchers proposed a novel INT4 training algorithm for Transformer.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

All high-cost linear operations for training Transformer can be written in the form of matrix multiplication (MM) .

This MM form allows us to design a more flexible quantizer, which can better approximate FP32 matrix multiplication by utilizing the specific structure of activations, weights and gradients in the Transformer. .

Advances in the field of Random Numerical Linear Algebra (RandNLA) are fully exploited by this quantizer.

For forward propagation, researchers found that outliers in activation are the main reason for the decrease in accuracy.

To suppress outliers, they proposed the Hadamard quantizer, which quantizes the transformed version of the activation matrix. This transformation is a block diagonal Hadamard matrix, which propagates the information carried in the outliers to neighboring entries of the matrix, thereby narrowing the numerical range of the outliers.

For backpropagation, they exploit the structural sparsity of the activation gradient. Researchers found that some tokens have very large gradients. At the same time, the gradients of most other tokens are very uniform, even more uniform than the quantized residuals of large gradients.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

Therefore, rather than computing all gradients, it is better to save computational resources in computing the residuals of larger gradients.

In order to take advantage of this sparsity, researchers proposed bit partitioning, which divides the gradient of each token into high 4 bits and low 4 bits.

Then, the most informative gradient is selected through leverage score sampling, which is an important sampling technique of RandNLA.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

Combining the quantification technology of forward and backward propagation, the researcher proposed a method to use INT4MM for Transformer Algorithms for all linear operations, and evaluate algorithms for training Transformers on a variety of tasks, including natural language understanding, question answering, machine translation, and image classification.

Their algorithm achieves competitive or higher accuracy compared to existing 4-bit training algorithms.

Additionally, this algorithm is compatible with contemporary hardware such as GPUs, as it does not require custom number formats such as FP4 or logarithmic formats.

This prototype quantized INT4 MM operator implementation is 2.2 times faster than the FP16MM baseline and increases the training speed by 35.1%.

Related Work

Fully Quantized Training

The Fully Quantized Training (FQT) method Activations, weights, and gradients are quantized to low precision to speed up training, so linear and nonlinear operators during training can be implemented with low-precision arithmetic.

FQT research has designed novel numerical formats and quantization algorithms that can better approximate full-precision tensors.

The current research frontier is 4-bit FQT. FQT is challenging due to the large numerical range of gradients and the optimization problem of training a quantized network from scratch.

Due to these challenges, existing 4-bit FQT algorithms still suffer from a 1-2.5% accuracy loss on some tasks and cannot support contemporary hardware.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Pictures

Other effective training methods

Mixed experts do not increase Improved model capacity within training budget.

Structural dropout utilizes computationally efficient methods to regularize the model. Efficient attention reduces the quadratic time complexity of computing attention.

The distributed training system reduces training time by utilizing more computing resources.

Researchers’ work to reduce numerical precision is orthogonal to these directions.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

Forward propagation

Neural Network training is an iterative optimization process that computes stochastic gradients through forward and backward propagation.

The research team uses a 4-bit integer (INT4) algorithm to accelerate forward and backward propagation.

Forward propagation can be implemented with a combination of linear and nonlinear (GeLU, normalization, softmax, etc.) operators.

During our training process, we accelerate all linear operators with INT4 arithmetic and keep all computationally less expensive nonlinear operators in 16-bit floating point (FP16) in format.

All linear operations in Transformer can be written in the form of matrix multiplication (MM).

For ease of expression, this article considers the following acceleration of simple matrix multiplication:

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

The main use case of this kind of MM is the fully connected layer.

Consider a Transformer whose input shape is (batch size S, sequence length T, dimension D).

The fully connected layer can be expressed as the above formula, where X is the activation of N = STtoken and W is the weight matrix.

For the attention layer, batch matrix multiplication (BMMS) may be required.

Our proposed technology can be applied to BMMS.

Learned Step Quantization

In order to speed up training, integer operations must be used to calculate forward propagation.

The researchers utilized Learning Step Quantizer (LSQ) for this purpose.

LSQ is a static quantization. Its quantization scale does not depend on the input method, so it is less expensive than the dynamic method. The quantization method needs to dynamically calculate the quantization scale in each iteration.

Activate outliers

Simply apply LSQ to Activation with 4 bits/ FQT of weights can lead to decreased accuracy because outliers are activated.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

As shown in the figure above, activation has some outlier entries, which are larger than other entries. many.

Unfortunately, Transformers tend to store information in these outliers, and such truncation can seriously hurt accuracy.

The outlier problem is particularly obvious when the training task is to fine-tune a pre-trained model on some new downstream tasks.

Because the pre-trained model contains more outliers than random initialization.

Hadamard Quantization

We propose Hadamard quantization (HQ) to solve the outlier problem.

The main idea is to quantize another matrix in a linear space with fewer outliers.

The outliers in the activation matrix form a feature-wise structure.

They are usually concentrated in a few dimensions, that is, only a few columns in X are significantly larger than other columns.

The Hardamand transform is a linear transformation that spreads outliers to other entries.

Backpropagation

Now we consider using INT4 operations to speed up the backward propagation of linear layers.

We will discuss the calculation of activation gradient/weight gradient in this section.

Structural sparsity of gradient

We noticed that the gradient matrix is ​​often very sparse during training.

And the sparsity has such a structure:

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!# has a few rows (such as tokens) with larger entries, and the larger Most other rows are close to all-zero vectors.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

This structural sparsity results from the severe over-parameterization of modern neural networks.

The network runs in a hyperparameterized scheme for almost the entire training process, and except for a few difficult examples, it adapts well to most training data.

Therefore, for well-fitted data points, the (activation) gradient will be close to zero.

The researchers found that for pre-training tasks, for example, structural sparsity appears quickly after a few training epochs.

For fine-tuning tasks, the gradient is always sparse throughout the training process.

Bit Splitting and Leverage Score Sampling

How to design a gradient quantizer to exploit structural sparsity accurately during backpropagation What about calculating MM?

The advanced idea is: many rows of gradients are so small that they have little impact on parameter gradients, but waste a lot of calculations.

On the other hand, large rows cannot be accurately represented by INT4.

We drop some small rows and use the saved computing power to represent large rows more accurately.

Experiments

Researchers evaluate our INT4 training algorithm fine-tuning on a variety of tasks including language models, machine translation and imagery Classification.

The researchers used CUDA and cutlass to execute their proposed HQ-MM and LSS-MM algorithms.

The researchers replaced all floating-point linear operators with INT4 implementations, but did not simply use LSQ to embed the layers, and maintained the accuracy of the last classifier layer.

Finally, the researchers adopted the default architecture, optimizer, scheduler, and hyperparameters for all evaluated models.

Convergence Model Accuracy

The researchers compared the accuracy of the convergence models on various tasks in the table below.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

The comparison methods include full precision training (FP), INT8 training (INT8), FP4 training (" Ultra-low"), 4-bit log quantization using LSQ for activations and weights (LSQ LUQ), and our algorithm that uses HQ for forward propagation and LSS for back propagation (HQ LSS).

"Ultra Low" has no public implementation, so we only list its performance on the original paper on the machine translation task.

With the exception of the large machine translation task and the large visual Transformer task, we repeat each run three times and report the standard deviation as a subscript in the table.

The researchers did not perform any type of knowledge distillation or data augmentation.

Ablation Experiment

The ablation experiment conducted by the researchers was designed to demonstrate the effectiveness of the forward and backward methods.

To study the effectiveness of forward propagation for different quantizers, we leave the backward propagation in FP16.

The results are shown below.

The new work of Zhu Juns team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!Picture

Computational and memory efficiency

Finally, the researchers passed the evaluation Their prototype implementation demonstrates the potential of their approach to accelerate neural network training.

And their implementation is not fully optimized yet.

The researchers also did not integrate linear operators with nonlinearity and normalization.

Therefore, the results do not fully reflect the potential of the INT4 training algorithm.

Fully optimized implementation requires extensive engineering and is beyond the scope of our paper.

Conclusion

The researchers proposed a training method for Transformer INT4 that is very friendly to hardware.

By analyzing the properties of MM in Transformer, researchers proposed HQ and LSS methods to quantify activations and gradients while maintaining accuracy.

On several important tasks, our method performs equally well or even better than existing INT4 methods.

The researchers' work may be extended to other MM architectures besides Transformers, such as MLP-Mixer, graph neural networks, and recurrent neural network networks.

This is their future research direction.

Wider impact: The researchers’ algorithm can increase efficiency and reduce the energy consumption of training neural networks, which could help reduce Carbon emissions caused by deep learning.

However, efficient training algorithms may also facilitate the development of large language models and malicious artificial intelligence applications that pose human security risks.

For example, related models and applications that will be used for false content generation.

Limitations: The main limitation of this work is that it can only accelerate large-scale matrix multiplications (linear layers) with larger model, but cannot speed up convolutional layers.

Moreover, the proposed method is not well applicable to very large models such as OPT-175B.

As far as we know, even INT8 training is still an unsolved problem for these very large models.

The above is the detailed content of The new work of Zhu Jun's team at Tsinghua University: Use 4-digit integers to train Transformer, which is 2.2 times faster than FP16, 35.1% faster, accelerating the arrival of AGI!. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsCalculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTAn easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionExplaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI ​​assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesHow do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing Now5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTAn easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.