Home  >  Article  >  Technology peripherals  >  Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

WBOY
WBOYforward
2023-04-13 09:31:061635browse

Although large-scale language models (LLM) have strong performance, the number of parameters can easily reach hundreds or hundreds of billions, and the demand for computing equipment and memory is so large that ordinary companies cannot afford it.

Quantization is a common compression operation. By reducing the accuracy of model weights (such as 32bit to 8bit), some model performance is sacrificed in exchange for faster inference speed, and more Low memory requirements.

But for LLMs with more than 100 billion parameters, existing compression methods cannot maintain the accuracy of the model, nor can they run efficiently on hardware.

Recently, researchers from MIT and NVIDIA jointly proposed a general-purpose post-training quantization (GPQ, general-purpose post-training quantization) solution SmoothQuant, for large language models, 8-bit weights and 8-bit activation (W8A8) quantification can be efficiently realized, and the accuracy of the model can be maintained without training.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

##Paper link: https://arxiv.org/pdf/2211.10438.pdf

Code link: https://github.com/mit-han-lab/smoothquant

Since activation is more difficult to quantify than weight, SmoothQuant transfers activations that are difficult to quantify to weights through mathematical equivalent transformation, achieving smooth processing of activation outliers.

SmoothQuant can quantize weights and activations in various layers of all LLMs to INT8, including OPT-175B, BLOOM-176B and GLM-130B.

Compared with existing methods that only perform weight optimization or quantize activations with mixed precision, SmoothQuant has higher hardware efficiency and achieves 1.56 times acceleration. The memory requirements are only half that of the original LLM, and there is almost no loss in accuracy.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

SmoothQuant also has a hardware-friendly design. The researchers integrated SmoothQuant into the LLM service framework FasterTransformer to achieve faster inference speed. Compared with the accuracy of FP16, only half the number of GPUs are required.

Instructor Song Han is an associate professor at MIT EECS. He graduated from Stanford University with a PhD. His main research direction is efficient deep learning. He once proposed deep compression technology, which can transform neural networks into The size is reduced by an order of magnitude without losing accuracy.

SmoothQuant

Quantization (Quantization) is to map high-precision values ​​to lower-precision discrete values. In this paper, researchers mainly focus on improving hardware Efficient integer uniform quantization, especially INT8.

Quantization operations can be performed at different granularities, such as per-tensor quantization is applied to the entire weight matrix, and per-token quantization is applied to activations For each token, per-channel quantization is applied to each output channel of the weight.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!#By observing the quantitative results of activation, the researchers concluded several patterns:


#1. Quantification is more difficult to quantify than weight.

The distribution of weights is relatively more uniform and flatter. Previous research results have proven that reducing the weight of a large language model to INT8 or even INT4 has little impact on accuracy.

#2. Outliers are the main difficulty in activation quantification.

#Outliers in activation are usually about 100 times higher than normal values, resulting in very low efficiency of quantization bits/levels in channels without outliers.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

3. Abnormal values ​​are fixed in a certain channel.

Outliers will only appear in a small number of channels, but if there is an outlier in one channel, the outlier may appear in all appears in the token.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

The variance of all channels in a given token will be large (some channels will be very large, but most will be small), but given The variance of a channel across all token degrees will be small (outlier channels will be large).

Since outliers have the characteristics of continuous occurrence and small variance within each channel, if per-channel quantization is performed on activations, the quantization error will be much smaller than per-tensor quantization .

Through a simple experiment, the results once again verified the researchers’ ideas. When quantized to INT8, the per-channel accuracy is much higher than per-tensor and per-token. Quantification, the accuracy is almost the same as the FP16 baseline.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

The researchers smoothed the input activation by using a per-channel smoothing factor s. To maintain mathematical equivalence of linear layers, the weights also need to be inversely scaled.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Since the input X is usually generated by previous linear operations (such as linear layers, layer norms, etc.), it can be easily The smoothing factor is blended into the parameters of the previous layer offline and does not incur the kernel call overhead of additional scaling. For other cases, such as when the input comes from residual add, an additional scaling can be added to the residual branch.

Transfer quantization difficulty from activations to weights

The goal of Smooth is to choose a per-channel smoothing factor s such that the inverse Operations are easier to quantify.

In order to reduce the quantization error, the effective quantization bits of all channels should be increased. When the maximum magnitude of all channels is the same, the total number of effective quantization bits will be the largest.

Therefore, one of the most direct smoothing factor choices is the maximum value of each channel in the input, which can ensure that after division, all activation channels have the same maximum value, thus achieving easier quantification.

But it should be noted that the activation range is dynamic and different for different input samples. So the researchers used calibration samples from the pre-training dataset to estimate the size of the activation channels.

Since this formula transfers all quantification difficulties to the weights, it can be found that in this case, the quantization error of the weights will be very large, resulting in a large decrease in accuracy.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

On the other hand, it is also possible to push all quantization difficulties from weights to activations by choosing sj ​​= 1/ max(|Wj |). Likewise, model performance is also poor due to excessive activation quantization errors. Therefore the quantification difficulty needs to be split between weights and activations to make them both easy to quantify.

The researchers introduced a hyperparameter transfer strength α to control the difficulty of transferring from activations to weights.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

It can be found that for most models, such as OPT and BLOOM models, α=0.5 is a good balance point, which can evenly distribute the quantization difficulty, especially using the same quantizer Perform weighting and activation.

This formula ensures that the weights and activations of corresponding channels have similar maximum values ​​and thus share the same quantization difficulty.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

For some other models with relatively large activation outliers, such as GLM-130B with 30% outliers, which is more difficult for activation quantification, you can choose a larger A large α (such as 0.75) transfers more quantification difficulty to the weights.

SmoothQuant is applied to the Transformer block

The linear layer takes up most of the parameters and calculations of the LLM model. By default, SmoothQuant scales the input activations of all linear layers in the Transformer and quantizes the linear layers with W8A8, which enables quantization of the BMM operator in the attention calculation.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

In the process, INT8 is first used to quantify the inputs and weights of computationally intensive operators such as BMM in the linear layer and attention layer, while other light Operations on magnitude elements, such as Softmax and LayerNorm, remain activated as FP16. This design helps balance accuracy and reasoning efficiency.

Experimental part

The researchers selected three large-scale language models to evaluate SmoothQuant, including OPT, BLOOM and GLM-130B; and used seven zero-shot tasks, including LAMBADA, HellaSwag , PIQA, WinoGrande, OpenBookQA, RTE, COPA, etc.

Experimental results show that SmoothQuant can handle the quantization problem of very large LLMs, and its activation is more difficult to quantify.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

SmoothQuant can match the accuracy of FP16 on all evaluation datasets, while the W8A8, ZeroQuant and Outlier Suppression baselines produce almost random results.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

And SmoothQuant can losslessly quantize all open LLMs with more than 100B parameters

SmoothQuant’s O1 and O2 levels successfully maintain floating point accuracy, while Level O3 (per-tensor static) reduces average accuracy by 0.8%, likely due to the difference between statically collected statistics and activation statistics of real evaluation samples.

Nonetheless, SmoothQuant-O1 can match the accuracy of FP16, while SmoothQuant-O3 only reduces the accuracy by 1%, which is significantly better than the baseline.

SmoothQuant is not only effective for very large LLMs with over 100B parameters, but also has stable results for smaller LLMs. SmoothQuant can work on all scales of OPT models and match the FP16 accuracy of INT8 quantization .

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

To demonstrate the speed improvements and memory savings of SmoothQuant-O3 integrated into PyTorch and FasterTransformer, we measured all hidden states generating a batch of 4 sentences at a time The end-to-end delay, that is, the delay in the context stage, and records the peak GPU memory usage during this process.

Due to Huggingface's lack of support for model parallelism, the researchers only measured the performance of SmoothQuant's PyTorch implementation on a single GPU, so OPT-6.7B, OPT-13B and OPT-30B were selected for evaluation.

In the FasterTransformer library, SmoothQuant can be seamlessly connected with the Tensor Parallelism algorithm, so the researchers tested SmoothQuant’s single-GPU and multi-GPU benchmarks on OPT-13B, OPT-30B, OPT-66B and OPT-175B. .

Experimental results conducted on NVIDIA A100 80GB GPU server show that SmoothQuant is always faster than the FP16 baseline in terms of inference latency and peak memory usage based on PyTorch implementation, when the sequence length is 256, on OPT-30B Obtained a 1.51 times speed increase.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

You can also see a trend that the larger the model, the more obvious the speedup, but LLM.int8() is almost always slower than the FP16 baseline, also due to mixed precision Caused by the huge overhead of activating representations.

In terms of memory, both SmoothQuant and LLM.int8() can almost halve the memory usage of the FP16 model, while SmoothQuant saves slightly more memory because it completely uses INT8 GEMM.

Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!

Compared with FasterTransformer's FP16 implementation of OPT, SmoothQuant-O3 can further reduce the execution latency of OPT-13B and OPT-30B when using a single GPU, by up to 1.56 times.


The above is the detailed content of Can’t a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete