GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training-AI-php.cn

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 02, 2023 am 08:37 AM

aialgorithm

We know that quantizing activations, weights, and gradients into 4-bit is very valuable for accelerating neural network training. But existing 4-bit training methods require custom number formats that are not supported by contemporary hardware. In this article, Tsinghua Zhu Jun et al. propose a Transformer training method that uses the INT4 algorithm to implement all matrix multiplications.

Whether the model is trained quickly or not is closely related to the requirements of activation values, weights, gradients and other factors.

Neural network training requires a certain amount of calculation, and using low-precision algorithms (full quantization training or FQT training) is expected to improve computing and memory efficiency. FQT adds quantizers and dequantizers to the original full-precision computational graph and replaces expensive floating-point operations with cheap low-precision floating-point operations.

Research on FQT aims to reduce training numerical accuracy while reducing the sacrifice of convergence speed and accuracy. The required numerical precision is reduced from FP16 to FP8, INT32 INT8 and INT8 INT5. FP8 training is done on Nvidia H100 GPUs with the Transformer engine, which enables amazing acceleration of large-scale Transformer training.

Recently, the accuracy of training numerical values has been reduced to 4 bits. Sun et al. successfully trained several contemporary networks with INT4 activations/weights and FP4 gradients; Chmiel et al. proposed a custom 4-digit logarithmic number format that further improved accuracy. However, these 4-bit training methods cannot be directly used for acceleration because they require custom number formats, which are not supported on contemporary hardware.

There are huge optimization challenges in training at an extremely low level of 4 bits. First, the non-differentiable quantizer of forward propagation will make the loss function graph uneven. Among them, the gradient-based Optimizers can easily get stuck in local optima. Secondly, the gradient can only be calculated approximately at low precision. This imprecise gradient will slow down the training process and even lead to unstable or divergent training.

This article proposes a new INT4 training algorithm for the popular neural network Transformer. The expensive linear operations used to train Transformers can be written in the form of matrix multiplication (MM). The MM formalism enables researchers to design more flexible quantizers. This quantizer better approximates FP32 matrix multiplication through specific activation, weight and gradient structures in Transformer. The quantizer in this article also takes advantage of new advances in stochastic numerical linear algebra.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

Paper address: https://arxiv.org/pdf/2306.11987.pdf

Research shows that for forward propagation, the main reason for the decrease in accuracy is outliers in activation. In order to suppress this outlier, the Hadamard quantizer is proposed, which is used to quantize the transformed activation matrix. This transformation is a block-diagonal Hadamard matrix, which spreads the information carried by the outliers to the matrix entries near the outliers, thereby narrowing the numerical range of the outliers.

For backpropagation, the study takes advantage of the structural sparsity of the activation gradient. Research shows that the gradients of some tokens are very large, but at the same time, the gradients of most other tokens are very small, and even the quantized residuals of larger gradients are smaller. Therefore, instead of computing these small gradients, the computational resources are used to compute the residuals of larger gradients.

Combining the quantization techniques of forward and back propagation, this article proposes an algorithm that uses INT4 MMs for all linear operations in Transformer. The study evaluated algorithms for training Transformer on a variety of tasks, including natural language understanding, question answering, machine translation, and image classification. The proposed algorithm achieves comparable or higher accuracy compared to existing 4-bit training efforts. Furthermore, the algorithm is compatible with contemporary hardware (such as GPUs) since it does not require custom number formats (such as FP4 or logarithmic formats). And the prototype quantized INT4 MM operator proposed by the study is 2.2 times faster than the FP16 MM baseline, increasing the training speed by 35.1%.

Forward propagation

#During the training process, the researchers used the INT4 algorithm to accelerate all linear operators and made all calculations more intensive. The low nonlinear operator is set to FP16 format. All linear operators in Transformer can be written in matrix multiplication form. For the sake of demonstration, they considered a simple matrix multiplication speedup as follows.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

The main use case for this kind of matrix multiplication is the fully connected layer.

Learned Step Size Quantization

Accelerated training must use integer arithmetic to calculate forward propagation. Therefore, the researchers utilized the learned step size quantizer (LSQ). As a static quantization method, LSQ's quantization scale does not depend on the input and is therefore less expensive than dynamic quantization methods. In contrast, dynamic quantization methods require dynamically calculating the quantization scale at each iteration.

Given a FP matrix X, LSQ quantizes X into an integer through the following formula (2).

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

Activate outliers

Simple Applying LSQ to FQT (fully quantized training) with 4-bit activation/weighting will lead to a decrease in accuracy due to activation of outliers. As shown in Figure 1 (a) below, there are some outlier terms that are activated, the magnitude of which is much larger than other terms.

In this case, the step size s_X is a trade-off between quantization granularity and the range of representable values. If s_X is large, outliers can be represented well at the cost of representing most other terms in a coarse way. If s_X is small, terms outside the range [−Q_Ns_X, Q_Ps_X] must be truncated.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training

Hadamard Quantization

The researcher proposed to use Hadamard quantizer (HQ ) to solve the outlier problem, its main idea is to quantize the matrix in another linear space with fewer outliers.

Outliers in the activation matrix can form feature-level structures. These outliers are usually clustered along a few dimensions, that is, only a few columns in X are significantly larger than the others. As a linear transformation, the Hadamard transform can spread outliers among other terms. Specifically, the Hadamard transform H_k is a 2^k × 2^k matrix.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training

#To suppress outliers, researchers quantize the transformed versions of X and W.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training

By combining the quantized matrices, the researcher obtained the following.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training

where the inverse transformations cancel each other out, and MM can be implemented as follows.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

Backpropagation

Researchers use INT4 operation to speed up Backpropagation of linear layers. The linear operator HQ-MM defined in Equation (3) has four inputs, namely activation X, weight W, and steps s_X and s_W. Given the output gradient ∇_YL with respect to the loss function L, they need to compute the gradients of these four inputs.

Structural sparsity of gradient

Researchers noticed that the gradient matrix ∇_Y is often very sparse during the training process . The sparsity structure is such that a few rows (i.e. tokens) of ∇_Y have large terms, while most other rows are close to all-zero vectors. They plotted a histogram of the per-row norm ∥(∇_Y)_i:∥ for all rows in Figure 2 below.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

Bit Split and Average Score Sampling

Researchers discuss how to design gradient quantizers to take advantage of structural sparsity to accurately calculate MM during backpropagation. The high-level idea is that the gradient of many rows is very small, so the impact on the parameter gradient is also small, but a lot of calculations are wasted. Additionally, large rows cannot be accurately represented by INT4.

To take advantage of this sparsity, researchers propose bit splitting, which splits the gradient of each token into higher 4bits and lower 4bits. Then the gradient with the most information is selected through average score sampling, which is an importance sampling technique of RandNLA.

Experimental Results

The study evaluated the INT4 training algorithm on a variety of tasks, including language model fine-tuning, machine translation, and image classification. The study implemented the proposed HQ-MM and LSS-MM algorithms using CUDA and cutlass2. In addition to simply using LSQ as the embedding layer, we replaced all floating point linear operators with INT4 and maintained the full accuracy of the last layer classifier. And, in doing so, the researchers adopted default architectures, optimizers, schedulers, and hyperparameters for all evaluated models.

Convergence model accuracy. Table 1 below shows the accuracy of the converged model on each task.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

Language model fine-tuning. Compared with LSQ LUQ, the algorithm proposed in the study improves the average accuracy by 5.5% on the bert-base model and 25% on the bert-large model.

The research team also demonstrated further results of the algorithm on SQUAD, SQUAD 2.0, Adversarial QA, CoNLL-2003 and SWAG datasets. On all tasks, this method achieves better performance compared to LSQ LUQ. Compared to LSQ LUQ, this method achieves improvements of 1.8% and 3.6% on SQUAD and SQUAD 2.0, respectively. In the more difficult adversarial QA, the method achieves a 6.8% improvement in F1 score. On SWAG and CoNLL-2003, this method improves the accuracy by 6.7% and 4.2% respectively.

machine translation. The study also used the proposed method for pre-training. This method trains a Transformer-based [51] model for machine translation on the WMT 14 En-De dataset.

HQ LSS has a BLEU degradation rate of about 1.0%, which is smaller than Ultra-low’s 2.1% and higher than the 0.3% reported in the LUQ paper. Nonetheless, HQ LSS still performs comparably to existing methods on this pre-training task, and it supports contemporary hardware.

Image classification. Study loading pretrained ViT checkpoints on ImageNet21k and fine-tuning them on CIFAR-10, CIFAR-100 and ImageNet1k.

Compared with LSQ LUQ, the research method improves the accuracy of ViT-B/32 and ViT-L/32 by 1.1% and 0.2% respectively. On ImageNet1k, this method improves accuracy by 2% on ViT-B/32, 2.6% on ViT-L/32, and 0.2% on ViT-L/32 compared to LSQ LUQ.

The research team further tested the effectiveness of the algorithm on pre-training the DeiT-Small model on ImageNet1K, in which HQ LSS can still converge to a similar level of accuracy compared with LSQ LUQ, while also Hardware is more friendly.

Ablation Study

The researchers conducted an ablation study to independently demonstrate the front-end data on the challenging CoLA dataset. Effectiveness of forward and reverse methods. To study the effectiveness of different quantizers on forward propagation, they set backpropagation to FP16. The results are shown in Figure 3(a) below.

For backpropagation, the researchers compared a simple minimax quantizer, LUQ, with their own LSS, and set forward propagation to FP16. The results are shown in Figure 3(b) below. Although the bit width is higher than 2, LSS achieves results that are comparable to or even slightly better than LUQ.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Pictures

Computation and memory efficiency

The researcher compared the throughput of the HQ-MM (HQ) proposed by him, the LSS that calculates the weight gradient (LSSWeight), the LSS that calculates the activation gradient (LSSAct), their average throughput (INT4) and the NVIDIA RTX 3090 in Figure 4 below. The baseline tensor core FP16 GEMM implementation provided by cutlass on the GPU (FP16) has a peak throughput of 142 FP16 TFLOPs and 568 INT4 TFLOPs.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training Picture

The researcher also compared FP16 PyTorch AMP and his own INT4 training algorithm to train BERT-like and GPT-like language models on 8 NVIDIA A100 GPUs Training throughput. They varied the hidden layer size, intermediate fully connected layer size, and batch size and plotted the speedup for INT4 training in Figure 5 below.

The results show that the INT4 training algorithm achieves up to 35.1% acceleration for BERT-like models and up to 26.5% acceleration for GPT-like models.

GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training picture

The above is the detailed content of GPT-like model training is accelerated by 26.5%. Tsinghua Zhu Jun and others use the INT4 algorithm to accelerate neural network training. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Will R.E.P.O. Have Crossplay?

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Notepad++7.3.1

Easy-to-use and free code editor

Hot Topics

Where is the login entrance for gmail email?

7555

CakePHP Tutorial

1383

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers