Home > Article > Technology peripherals > ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP
Model quantization is a key technology in model compression and acceleration. It quantizes model weights and activation values to low bits, allowing the model to occupy less memory overhead and speed up inference. For large language models with massive parameters, model quantification is even more important. For example, the 175B parameters of the GPT-3 model consume 350GB of memory when loaded using the FP16 format, requiring at least five 80GB A100 GPUs.
But if the weights of the GPT-3 model can be compressed to 3bit, then a single A100-80GB can be used to load all model weights.
At present, there is an obvious challenge in the existing large-scale language model post-training quantization algorithm, that is, it relies on manual setting of quantization parameters and lacks a corresponding optimization process. This results in existing methods often experiencing performance degradation when performing low-bit quantization. Although quantization-aware training is effective in determining the optimal quantization configuration, it requires additional training costs and data support. Especially in large-scale language models, the amount of calculation itself is already large, which makes the application of quantization-aware training in quantization of large-scale language models more difficult.
This begs the question: Can we achieve the performance of quantization-aware training while maintaining the time and data efficiency of post-training quantization?
In order to deal with the problem of quantization parameter optimization during post-training of large language models, a group of researchers from the Shanghai Artificial Intelligence Laboratory, the University of Hong Kong and the Chinese University of Hong Kong proposed "OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models". This algorithm not only supports the quantization of weights and activations in large language models, but can also adapt to various different quantization bit settings.
arXiv paper address: https://arxiv.org/abs/2308.13137
OpenReview paper address: https://openreview.net/forum? id=8Wuvhh0LYW
Code address: https://github.com/OpenGVLab/OmniQuant
Framework method
As shown in the figure above, OmniQuant is a differentiable quantization technology for large language models (LLM), supporting both weight-only quantization and weight activation value simultaneous quantization. Moreover, it achieves a high-performance quantization model while maintaining the training time efficiency and data efficiency of post-training quantization. For example, OmniQuant can update the quantization parameters of LLaMA-7B ~ LLaMA70B models within 1-16 hours on a single card A100-40GB.
To achieve this goal, OmniQuant adopts a Block-wise quantization error minimization framework. At the same time, OmniQuant has designed two novel strategies to increase learnable quantization parameters, including learnable weight clipping (LWC) to reduce the difficulty of quantizing weights, and a learnable equivalent transformation (Learnable Equivalent Transformation (LET), further shifts the quantization challenge from activation values to weights.
In addition, all learnable parameters introduced by OmniQuant can be fused and eliminated after quantization is completed, and the quantization model can be deployed on multiple platforms based on existing tools, including GPU, Android, IOS, etc.
Block-wise quantization error minimization
OmniQuant proposes a new optimization process that uses Block-wise quantization error minimization and uses differentiable way to optimize additional quantization parameters. Among them, the optimization objective is formulated as follows:
where F represents the mapping function of a transformer block in LLM, W and represents the weight and activation quantizer respectively, and are the quantization parameters in learnable weight clipping (LWC) and learnable equivalent transformation (LET) respectively. OmniQuant installs Block-wise quantization to sequentially quantize parameters in one Transformer Block before moving to the next.
Learnable Weight Clipping (LWC)
Equivalent transformation performs magnitude migration between model weights and activation values. The learnable equivalent transformation adopted by OmniQuant causes the distribution of model weights to continuously change with training during the parameter optimization process. Previous methods of directly learning weight clipping thresholds [1,2] are only suitable when the weight distribution does not change drastically, otherwise it will be difficult to converge. Based on this problem, unlike previous methods that directly learn the weight clipping threshold, LWC optimizes the clipping intensity in the following way:
where ⌊⋅⌉ represents the rounding operation. N is the target number of digits. and W represent the quantized and full-precision weights respectively. h is the normalization factor of the weights and z is the zero point value. The clamp operation limits the quantized value to the range of N-bit integers, that is, . In the above formula, and are the learnable clipping strengths of the upper and lower bounds of the weight respectively. Therefore, in the optimization objective function.
Learnable Equivalence Transformation (LET)
In addition to optimizing the clipping threshold to achieve LWC with weights more suitable for quantization, OmniQuant further reduces activation through LET The difficulty of quantifying the value. Considering that outliers in LLM activation values exist in specific channels, previous methods such as SmoothQuant [3], Outlier Supression [4] transfer the difficulty of quantization from activation values to weights through mathematically equivalent transformations.
However, equivalent transformation parameters obtained by manual selection or greedy search will limit the performance of the quantized model. Thanks to the introduction of Block-wise quantization error minimization, OmniQuant's LET can determine the optimal equivalent transformation parameters in a differentiable way. Inspired by Outlier Suppression ~\citep {outlier-plus}, channel-level scaling and channel-level shifting are used to manipulate the activation distribution, providing an effective solution to the outlier problem in activation values. Specifically, OmniQuant explores equivalent transformations in linear layers and attention operations.
Equivalent transformation in a linear layer: The linear layer accepts an input token sequence , where T is the token length and is the product of the weight matrix and the bias vector . The mathematically equivalent linear layer expression is:
Y represents the output, and are channel-level scaling and shifting parameters respectively, and are equivalent activation, weight and bias respectively, ⊘ and ⊙ represent element-level of division and multiplication. Through the equivalent conversion of the above formula, the activation value is converted into a form that is easier to quantify, at the expense of increasing the difficulty of quantifying the weight. In this sense, LWC can improve the model quantization performance achieved by LET because it makes the weights easier to quantify. Finally, OmniQuant quantizes the transformed activations and weights as follows
where Q_a is the ordinary MinMax quantizer and Q_w is the quantizer with learnable weight clipping (i.e. MinMax quantizer of the proposed LWC).
Equivalent transformation in attention operations: In addition to linear layers, attention operations also occupy most of the calculations of LLM. Furthermore, the autoregressive inference mode of LLM requires storing a key-value (KV) cache for each token, which results in huge memory requirements for long sequences. Therefore, OmniQuant also considers quantizing the Q/K/V matrix in autonomous force calculations to low bits. Specifically, the learnable equivalent transformation in the self-attention matrix can be written as:
where is the scaling factor. The quantitative calculation in self-attention calculation is expressed as . Here OmniQuant also uses the MinMax quantization scheme as to quantize the matrix. Therefore, in the objective function is ultimately optimized.
Pseudocode
The pseudo algorithm of OmniQuant is shown in the figure above. Note that the extra parameters introduced by LWC and LET can be eliminated after the model is quantized, that is, OmniQuant does not introduce any additional overhead to the quantized model, so it can be directly adapted to existing quantization deployment tools.
Experimental performance
The above figure shows the experimental results of OmniQuant on the LLaMA model with only weight quantization results, more OPT models See the original text for detailed results. As can be seen, OmniQuant consistently outperforms previous models in various LLM models (OPT, LLaMA-1, LLaMA-2) and diverse quantization configurations (including W2A16, W2A16g128, W2A16g64, W3A16, W3A16g128, W4A16 and W4A16g128) LLM is a weight quantification method only. At the same time, these experiments demonstrate the versatility of OmniQuant and its ability to adapt to a variety of quantification configurations. For example, while AWQ [5] is particularly effective at group quantization, OmniQuant shows superior performance in both channel-level and group-level quantization. Additionally, as the number of quantization bits decreases, OmniQuant’s performance advantages become even more apparent.
In a setting where both weights and activations are quantized, the main focus of the experiment is on W6A6 and W4A4 quantization. W8A8 quantization was excluded from the experimental setup because previous SmoothQuant achieved almost lossless W8A8 model quantization compared to full-precision models. The above figure shows the experimental results of OmniQuant's quantification of weight activation values on the LLaMA model. Notably, OmniQuant significantly improves the average accuracy across different models of W4A4 quantification, ranging from 4.99% to 11.80%. Especially in the LLaMA-7B model, OmniQuant even surpasses the recent quantization-aware training method LLM-QAT [6] by a significant gap of 6.22%. This improvement demonstrates the effectiveness of introducing additional learnable parameters, which is more beneficial than the global weight adjustments employed in quantization-aware training.
Meanwhile, models quantified using OmniQuant can be seamlessly deployed on MLC-LLM [7]. The above figure shows the memory requirements and inference speed of the LLaMA series quantization model on NVIDIA A100-80G.
Weights Memory (WM) represents quantized weight storage, while Running Memory (RM) represents memory during inference, the latter being higher because certain activation values are retained. Inference speed is measured by generating 512 tokens. It is obvious that the quantized model significantly reduces memory usage compared to the 16-bit full-precision model. Furthermore, W4A16g128 and W2A16g128 quantization nearly doubles the inference speed.
It is worth noting that MLC-LLM [7] also supports the deployment of OmniQuant quantification models on other platforms, including Android phones and IOS phones. As shown in the figure above, the recent application Private LLM uses the OmniQuant algorithm to complete the memory-efficient deployment of LLM on multiple platforms such as iPhone, iPad, macOS, etc.
Summary
OmniQuant is an advanced large language model quantization algorithm that advances quantization to a low-bit format. The core principle of OmniQuant is to retain the original full-precision weights while adding learnable quantization parameters. It utilizes learnable weight connections and equivalent transformations to optimize the quantization compatibility of weights and activation values. While incorporating gradient updates, OmniQuant maintains training time efficiency and data efficiency comparable to existing PTQ methods. In addition, OmniQuant ensures hardware compatibility as its added trainable parameters can be incorporated into the original model without any additional overhead.
Reference
[1] Pact: Parameterized clipping activation for quantized neural networks.
[2] LSQ: Learned step size quantization.
[3] Smoothquant: Accurate and efficient post-training quantization for large language models.
[4] Outlier suppression : Accurate quantization of large language models by equivalent and optimal shifting and scaling.
[5] Awq: Activation-aware weight quantization for llm compression and acceleration.
[6] Llm-qat: Data-free quantization aware training for large language models.
[7] MLC-LLM :https://github.com/mlc-ai/mlc-llm
The above is the detailed content of ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP. For more information, please follow other related articles on the PHP Chinese website!