Home >Technology peripherals >AI >Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!
Although large-scale language models (LLM) have strong performance, the number of parameters can easily reach hundreds or hundreds of billions, and the demand for computing equipment and memory is so large that ordinary companies cannot afford it.
Quantization is a common compression operation. By reducing the accuracy of model weights (such as 32bit to 8bit), some model performance is sacrificed in exchange for faster inference speed, and more Low memory requirements.
But for LLMs with more than 100 billion parameters, existing compression methods cannot maintain the accuracy of the model, nor can they run efficiently on hardware.
Recently, researchers from MIT and NVIDIA jointly proposed a general-purpose post-training quantization (GPQ, general-purpose post-training quantization) solution SmoothQuant, for large language models, 8-bit weights and 8-bit activation (W8A8) quantification can be efficiently realized, and the accuracy of the model can be maintained without training.
##Paper link: https://arxiv.org/pdf/2211.10438.pdf
Code link: https://github.com/mit-han-lab/smoothquant
Since activation is more difficult to quantify than weight, SmoothQuant transfers activations that are difficult to quantify to weights through mathematical equivalent transformation, achieving smooth processing of activation outliers.
SmoothQuant can quantize weights and activations in various layers of all LLMs to INT8, including OPT-175B, BLOOM-176B and GLM-130B.
Compared with existing methods that only perform weight optimization or quantize activations with mixed precision, SmoothQuant has higher hardware efficiency and achieves 1.56 times acceleration. The memory requirements are only half that of the original LLM, and there is almost no loss in accuracy.
SmoothQuant also has a hardware-friendly design. The researchers integrated SmoothQuant into the LLM service framework FasterTransformer to achieve faster inference speed. Compared with the accuracy of FP16, only half the number of GPUs are required.Instructor Song Han is an associate professor at MIT EECS. He graduated from Stanford University with a PhD. His main research direction is efficient deep learning. He once proposed deep compression technology, which can transform neural networks into The size is reduced by an order of magnitude without losing accuracy.
SmoothQuant
Quantization (Quantization) is to map high-precision values to lower-precision discrete values. In this paper, researchers mainly focus on improving hardware Efficient integer uniform quantization, especially INT8.
Quantization operations can be performed at different granularities, such as per-tensor quantization is applied to the entire weight matrix, and per-token quantization is applied to activations For each token, per-channel quantization is applied to each output channel of the weight.
#By observing the quantitative results of activation, the researchers concluded several patterns:
#1. Quantification is more difficult to quantify than weight.
The distribution of weights is relatively more uniform and flatter. Previous research results have proven that reducing the weight of a large language model to INT8 or even INT4 has little impact on accuracy. #2. Outliers are the main difficulty in activation quantification. #Outliers in activation are usually about 100 times higher than normal values, resulting in very low efficiency of quantization bits/levels in channels without outliers. 3. Abnormal values are fixed in a certain channel. Outliers will only appear in a small number of channels, but if there is an outlier in one channel, the outlier may appear in all appears in the token. The variance of all channels in a given token will be large (some channels will be very large, but most will be small), but given The variance of a channel across all token degrees will be small (outlier channels will be large). Since outliers have the characteristics of continuous occurrence and small variance within each channel, if per-channel quantization is performed on activations, the quantization error will be much smaller than per-tensor quantization . Through a simple experiment, the results once again verified the researchers’ ideas. When quantized to INT8, the per-channel accuracy is much higher than per-tensor and per-token. Quantification, the accuracy is almost the same as the FP16 baseline. The researchers smoothed the input activation by using a per-channel smoothing factor s. To maintain mathematical equivalence of linear layers, the weights also need to be inversely scaled. Since the input X is usually generated by previous linear operations (such as linear layers, layer norms, etc.), it can be easily The smoothing factor is blended into the parameters of the previous layer offline and does not incur the kernel call overhead of additional scaling. For other cases, such as when the input comes from residual add, an additional scaling can be added to the residual branch. The goal of Smooth is to choose a per-channel smoothing factor s such that the inverse Operations are easier to quantify. In order to reduce the quantization error, the effective quantization bits of all channels should be increased. When the maximum magnitude of all channels is the same, the total number of effective quantization bits will be the largest. Therefore, one of the most direct smoothing factor choices is the maximum value of each channel in the input, which can ensure that after division, all activation channels have the same maximum value, thus achieving easier quantification. But it should be noted that the activation range is dynamic and different for different input samples. So the researchers used calibration samples from the pre-training dataset to estimate the size of the activation channels. Since this formula transfers all quantification difficulties to the weights, it can be found that in this case, the quantization error of the weights will be very large, resulting in a large decrease in accuracy. On the other hand, it is also possible to push all quantization difficulties from weights to activations by choosing sj = 1/ max(|Wj |). Likewise, model performance is also poor due to excessive activation quantization errors. Therefore the quantification difficulty needs to be split between weights and activations to make them both easy to quantify. The researchers introduced a hyperparameter transfer strength α to control the difficulty of transferring from activations to weights. It can be found that for most models, such as OPT and BLOOM models, α=0.5 is a good balance point, which can evenly distribute the quantization difficulty, especially using the same quantizer Perform weighting and activation. This formula ensures that the weights and activations of corresponding channels have similar maximum values and thus share the same quantization difficulty. For some other models with relatively large activation outliers, such as GLM-130B with 30% outliers, which is more difficult for activation quantification, you can choose a larger A large α (such as 0.75) transfers more quantification difficulty to the weights. SmoothQuant is applied to the Transformer block The linear layer takes up most of the parameters and calculations of the LLM model. By default, SmoothQuant scales the input activations of all linear layers in the Transformer and quantizes the linear layers with W8A8, which enables quantization of the BMM operator in the attention calculation. In the process, INT8 is first used to quantify the inputs and weights of computationally intensive operators such as BMM in the linear layer and attention layer, while other light Operations on magnitude elements, such as Softmax and LayerNorm, remain activated as FP16. This design helps balance accuracy and reasoning efficiency. The researchers selected three large-scale language models to evaluate SmoothQuant, including OPT, BLOOM and GLM-130B; and used seven zero-shot tasks, including LAMBADA, HellaSwag , PIQA, WinoGrande, OpenBookQA, RTE, COPA, etc. Experimental results show that SmoothQuant can handle the quantization problem of very large LLMs, and its activation is more difficult to quantify. SmoothQuant can match the accuracy of FP16 on all evaluation datasets, while the W8A8, ZeroQuant and Outlier Suppression baselines produce almost random results. And SmoothQuant can losslessly quantize all open LLMs with more than 100B parameters SmoothQuant’s O1 and O2 levels successfully maintain floating point accuracy, while Level O3 (per-tensor static) reduces average accuracy by 0.8%, likely due to the difference between statically collected statistics and activation statistics of real evaluation samples. Nonetheless, SmoothQuant-O1 can match the accuracy of FP16, while SmoothQuant-O3 only reduces the accuracy by 1%, which is significantly better than the baseline. SmoothQuant is not only effective for very large LLMs with over 100B parameters, but also has stable results for smaller LLMs. SmoothQuant can work on all scales of OPT models and match the FP16 accuracy of INT8 quantization . To demonstrate the speed improvements and memory savings of SmoothQuant-O3 integrated into PyTorch and FasterTransformer, we measured all hidden states generating a batch of 4 sentences at a time The end-to-end delay, that is, the delay in the context stage, and records the peak GPU memory usage during this process. Due to Huggingface's lack of support for model parallelism, the researchers only measured the performance of SmoothQuant's PyTorch implementation on a single GPU, so OPT-6.7B, OPT-13B and OPT-30B were selected for evaluation. In the FasterTransformer library, SmoothQuant can be seamlessly connected with the Tensor Parallelism algorithm, so the researchers tested SmoothQuant’s single-GPU and multi-GPU benchmarks on OPT-13B, OPT-30B, OPT-66B and OPT-175B. . Experimental results conducted on NVIDIA A100 80GB GPU server show that SmoothQuant is always faster than the FP16 baseline in terms of inference latency and peak memory usage based on PyTorch implementation, when the sequence length is 256, on OPT-30B Obtained a 1.51 times speed increase. You can also see a trend that the larger the model, the more obvious the speedup, but LLM.int8() is almost always slower than the FP16 baseline, also due to mixed precision Caused by the huge overhead of activating representations. In terms of memory, both SmoothQuant and LLM.int8() can almost halve the memory usage of the FP16 model, while SmoothQuant saves slightly more memory because it completely uses INT8 GEMM. Compared with FasterTransformer's FP16 implementation of OPT, SmoothQuant-O3 can further reduce the execution latency of OPT-13B and OPT-30B when using a single GPU, by up to 1.56 times. Transfer quantization difficulty from activations to weights
Experimental part
The above is the detailed content of Can't a language model with 10 billion parameters run? A Chinese doctor from MIT proposed SmoothQuant quantification, which reduced memory requirements by half and increased speed by 1.56 times!. For more information, please follow other related articles on the PHP Chinese website!