Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!-AI-php.cn

Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 13, 2023 am 09:31 AM

parameterModel

Although large-scale language models (LLM) have strong performance, the number of parameters can easily reach hundreds or hundreds of billions, and the demand for computing equipment and memory is so large that ordinary companies cannot afford it.

Quantization is a common compression operation. By reducing the accuracy of model weights (such as 32bit to 8bit), some model performance is sacrificed in exchange for faster inference speed, and more Low memory requirements.

But for LLMs with more than 100 billion parameters, existing compression methods cannot maintain the accuracy of the model, nor can they run efficiently on hardware.

Recently, researchers from MIT and NVIDIA jointly proposed a general-purpose post-training quantization (GPQ, general-purpose post-training quantization) solution SmoothQuant, for large language models, 8-bit weights and 8-bit activation (W8A8) quantification can be efficiently realized, and the accuracy of the model can be maintained without training.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

##Paper link: https://arxiv.org/pdf/2211.10438.pdf

Code link: https://github.com/mit-han-lab/smoothquant

Since activation is more difficult to quantify than weight, SmoothQuant transfers activations that are difficult to quantify to weights through mathematical equivalent transformation, achieving smooth processing of activation outliers.

SmoothQuant can quantize weights and activations in various layers of all LLMs to INT8, including OPT-175B, BLOOM-176B and GLM-130B.

Compared with existing methods that only perform weight optimization or quantize activations with mixed precision, SmoothQuant has higher hardware efficiency and achieves 1.56 times acceleration. The memory requirements are only half that of the original LLM, and there is almost no loss in accuracy.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

SmoothQuant also has a hardware-friendly design. The researchers integrated SmoothQuant into the LLM service framework FasterTransformer to achieve faster inference speed. Compared with the accuracy of FP16, only half the number of GPUs are required.

Instructor Song Han is an associate professor at MIT EECS. He graduated from Stanford University with a PhD. His main research direction is efficient deep learning. He once proposed deep compression technology, which can transform neural networks into The size is reduced by an order of magnitude without losing accuracy.

SmoothQuant

Quantization (Quantization) is to map high-precision values to lower-precision discrete values. In this paper, researchers mainly focus on improving hardware Efficient integer uniform quantization, especially INT8.

Quantization operations can be performed at different granularities, such as per-tensor quantization is applied to the entire weight matrix, and per-token quantization is applied to activations For each token, per-channel quantization is applied to each output channel of the weight.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times! #By observing the quantitative results of activation, the researchers concluded several patterns:

#1. Quantification is more difficult to quantify than weight.

The distribution of weights is relatively more uniform and flatter. Previous research results have proven that reducing the weight of a large language model to INT8 or even INT4 has little impact on accuracy.

#2. Outliers are the main difficulty in activation quantification.

#Outliers in activation are usually about 100 times higher than normal values, resulting in very low efficiency of quantization bits/levels in channels without outliers.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

3. Abnormal values are fixed in a certain channel.

Outliers will only appear in a small number of channels, but if there is an outlier in one channel, the outlier may appear in all appears in the token.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

The variance of all channels in a given token will be large (some channels will be very large, but most will be small), but given The variance of a channel across all token degrees will be small (outlier channels will be large).

Since outliers have the characteristics of continuous occurrence and small variance within each channel, if per-channel quantization is performed on activations, the quantization error will be much smaller than per-tensor quantization .

Through a simple experiment, the results once again verified the researchers’ ideas. When quantized to INT8, the per-channel accuracy is much higher than per-tensor and per-token. Quantification, the accuracy is almost the same as the FP16 baseline.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

The researchers smoothed the input activation by using a per-channel smoothing factor s. To maintain mathematical equivalence of linear layers, the weights also need to be inversely scaled.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Since the input X is usually generated by previous linear operations (such as linear layers, layer norms, etc.), it can be easily The smoothing factor is blended into the parameters of the previous layer offline and does not incur the kernel call overhead of additional scaling. For other cases, such as when the input comes from residual add, an additional scaling can be added to the residual branch.

Transfer quantization difficulty from activations to weights

The goal of Smooth is to choose a per-channel smoothing factor s such that the inverse Operations are easier to quantify.

In order to reduce the quantization error, the effective quantization bits of all channels should be increased. When the maximum magnitude of all channels is the same, the total number of effective quantization bits will be the largest.

Therefore, one of the most direct smoothing factor choices is the maximum value of each channel in the input, which can ensure that after division, all activation channels have the same maximum value, thus achieving easier quantification.

But it should be noted that the activation range is dynamic and different for different input samples. So the researchers used calibration samples from the pre-training dataset to estimate the size of the activation channels.

Since this formula transfers all quantification difficulties to the weights, it can be found that in this case, the quantization error of the weights will be very large, resulting in a large decrease in accuracy.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

On the other hand, it is also possible to push all quantization difficulties from weights to activations by choosing sj = 1/ max(|Wj |). Likewise, model performance is also poor due to excessive activation quantization errors. Therefore the quantification difficulty needs to be split between weights and activations to make them both easy to quantify.

The researchers introduced a hyperparameter transfer strength α to control the difficulty of transferring from activations to weights.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

It can be found that for most models, such as OPT and BLOOM models, α=0.5 is a good balance point, which can evenly distribute the quantization difficulty, especially using the same quantizer Perform weighting and activation.

This formula ensures that the weights and activations of corresponding channels have similar maximum values and thus share the same quantization difficulty.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

For some other models with relatively large activation outliers, such as GLM-130B with 30% outliers, which is more difficult for activation quantification, you can choose a larger A large α (such as 0.75) transfers more quantification difficulty to the weights.

SmoothQuant is applied to the Transformer block

The linear layer takes up most of the parameters and calculations of the LLM model. By default, SmoothQuant scales the input activations of all linear layers in the Transformer and quantizes the linear layers with W8A8, which enables quantization of the BMM operator in the attention calculation.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

In the process, INT8 is first used to quantify the inputs and weights of computationally intensive operators such as BMM in the linear layer and attention layer, while other light Operations on magnitude elements, such as Softmax and LayerNorm, remain activated as FP16. This design helps balance accuracy and reasoning efficiency.

Experimental part

The researchers selected three large-scale language models to evaluate SmoothQuant, including OPT, BLOOM and GLM-130B; and used seven zero-shot tasks, including LAMBADA, HellaSwag , PIQA, WinoGrande, OpenBookQA, RTE, COPA, etc.

Experimental results show that SmoothQuant can handle the quantization problem of very large LLMs, and its activation is more difficult to quantify.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

SmoothQuant can match the accuracy of FP16 on all evaluation datasets, while the W8A8, ZeroQuant and Outlier Suppression baselines produce almost random results.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

And SmoothQuant can losslessly quantize all open LLMs with more than 100B parameters

SmoothQuant’s O1 and O2 levels successfully maintain floating point accuracy, while Level O3 (per-tensor static) reduces average accuracy by 0.8%, likely due to the difference between statically collected statistics and activation statistics of real evaluation samples.

Nonetheless, SmoothQuant-O1 can match the accuracy of FP16, while SmoothQuant-O3 only reduces the accuracy by 1%, which is significantly better than the baseline.

SmoothQuant is not only effective for very large LLMs with over 100B parameters, but also has stable results for smaller LLMs. SmoothQuant can work on all scales of OPT models and match the FP16 accuracy of INT8 quantization .

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

To demonstrate the speed improvements and memory savings of SmoothQuant-O3 integrated into PyTorch and FasterTransformer, we measured all hidden states generating a batch of 4 sentences at a time The end-to-end delay, that is, the delay in the context stage, and records the peak GPU memory usage during this process.

Due to Huggingface's lack of support for model parallelism, the researchers only measured the performance of SmoothQuant's PyTorch implementation on a single GPU, so OPT-6.7B, OPT-13B and OPT-30B were selected for evaluation.

In the FasterTransformer library, SmoothQuant can be seamlessly connected with the Tensor Parallelism algorithm, so the researchers tested SmoothQuant’s single-GPU and multi-GPU benchmarks on OPT-13B, OPT-30B, OPT-66B and OPT-175B. .

Experimental results conducted on NVIDIA A100 80GB GPU server show that SmoothQuant is always faster than the FP16 baseline in terms of inference latency and peak memory usage based on PyTorch implementation, when the sequence length is 256, on OPT-30B Obtained a 1.51 times speed increase.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

You can also see a trend that the larger the model, the more obvious the speedup, but LLM.int8() is almost always slower than the FP16 baseline, also due to mixed precision Caused by the huge overhead of activating representations.

In terms of memory, both SmoothQuant and LLM.int8() can almost halve the memory usage of the FP16 model, while SmoothQuant saves slightly more memory because it completely uses INT8 GEMM.

Cant a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Compared with FasterTransformer's FP16 implementation of OPT, SmoothQuant-O3 can further reduce the execution latency of OPT-13B and OPT-30B when using a single GPU, by up to 1.56 times.

The above is the detailed content of Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Are You At Risk Of AI Agency Decay? Take The Test To Find OutApr 21, 2025 am 11:31 AM

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

How to Build an AI Agent from Scratch? - Analytics VidhyaApr 21, 2025 am 11:30 AM

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

Revisiting The Humanities In The Age Of AIApr 21, 2025 am 11:28 AM

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

Understanding LangChain Agent FrameworkApr 21, 2025 am 11:25 AM

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

What are the Radial Basis Functions Neural Networks?Apr 21, 2025 am 11:13 AM

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

The Meshing Of Minds And Machines Has ArrivedApr 21, 2025 am 11:11 AM

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

Insights on spaCy, Prodigy and Generative AI from Ines MontaniApr 21, 2025 am 11:01 AM

This "Leading with Data" episode features Ines Montani, co-founder and CEO of Explosion AI, and co-developer of spaCy and Prodigy. Ines offers expert insights into the evolution of these tools, Explosion's unique business model, and the tr

A Guide to Building Agentic RAG Systems with LangGraphApr 21, 2025 am 11:00 AM

This article explores Retrieval Augmented Generation (RAG) systems and how AI agents can enhance their capabilities. Traditional RAG systems, while useful for leveraging custom enterprise data, suffer from limitations such as a lack of real-time dat

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

Useful JavaScript development tools

Atom editor mac version download

The most popular open source editor

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7631

CakePHP Tutorial

1389

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

141