Home  >  Article  >  Technology peripherals  >  ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP

ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP

PHPz
PHPzforward
2024-03-07 16:16:16759browse

Model quantization is a key technology in model compression and acceleration. It quantizes model weights and activation values ​​to low bits, allowing the model to occupy less memory overhead and speed up inference. For large language models with massive parameters, model quantification is even more important. For example, the 175B parameters of the GPT-3 model consume 350GB of memory when loaded using the FP16 format, requiring at least five 80GB A100 GPUs.

But if the weights of the GPT-3 model can be compressed to 3bit, then a single A100-80GB can be used to load all model weights.

At present, there is an obvious challenge in the existing large-scale language model post-training quantization algorithm, that is, it relies on manual setting of quantization parameters and lacks a corresponding optimization process. This results in existing methods often experiencing performance degradation when performing low-bit quantization. Although quantization-aware training is effective in determining the optimal quantization configuration, it requires additional training costs and data support. Especially in large-scale language models, the amount of calculation itself is already large, which makes the application of quantization-aware training in quantization of large-scale language models more difficult.

This begs the question: Can we achieve the performance of quantization-aware training while maintaining the time and data efficiency of post-training quantization?

In order to deal with the problem of quantization parameter optimization during post-training of large language models, a group of researchers from the Shanghai Artificial Intelligence Laboratory, the University of Hong Kong and the Chinese University of Hong Kong proposed "OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models". This algorithm not only supports the quantization of weights and activations in large language models, but can also adapt to various different quantization bit settings.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

arXiv paper address: https://arxiv.org/abs/2308.13137

OpenReview paper address: https://openreview.net/forum? id=8Wuvhh0LYW

Code address: https://github.com/OpenGVLab/OmniQuant

Framework method

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

As shown in the figure above, OmniQuant is a differentiable quantization technology for large language models (LLM), supporting both weight-only quantization and weight activation value simultaneous quantization. Moreover, it achieves a high-performance quantization model while maintaining the training time efficiency and data efficiency of post-training quantization. For example, OmniQuant can update the quantization parameters of LLaMA-7B ~ LLaMA70B models within 1-16 hours on a single card A100-40GB.

To achieve this goal, OmniQuant adopts a Block-wise quantization error minimization framework. At the same time, OmniQuant has designed two novel strategies to increase learnable quantization parameters, including learnable weight clipping (LWC) to reduce the difficulty of quantizing weights, and a learnable equivalent transformation (Learnable Equivalent Transformation (LET), further shifts the quantization challenge from activation values ​​to weights.

In addition, all learnable parameters introduced by OmniQuant can be fused and eliminated after quantization is completed, and the quantization model can be deployed on multiple platforms based on existing tools, including GPU, Android, IOS, etc.

Block-wise quantization error minimization

OmniQuant proposes a new optimization process that uses Block-wise quantization error minimization and uses differentiable way to optimize additional quantization parameters. Among them, the optimization objective is formulated as follows:

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

where F represents the mapping function of a transformer block in LLM, W and ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP represents the weight and activation quantizer respectively, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP and ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP are the quantization parameters in learnable weight clipping (LWC) and learnable equivalent transformation (LET) respectively. OmniQuant installs Block-wise quantization to sequentially quantize parameters in one Transformer Block before moving to the next. ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

Learnable Weight Clipping (LWC)

Equivalent transformation performs magnitude migration between model weights and activation values. The learnable equivalent transformation adopted by OmniQuant causes the distribution of model weights to continuously change with training during the parameter optimization process. Previous methods of directly learning weight clipping thresholds [1,2] are only suitable when the weight distribution does not change drastically, otherwise it will be difficult to converge. Based on this problem, unlike previous methods that directly learn the weight clipping threshold, LWC optimizes the clipping intensity in the following way:

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

where ⌊⋅⌉ represents the rounding operation. N is the target number of digits. ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP and W represent the quantized and full-precision weights respectively. h is the normalization factor of the weights and z is the zero point value. The clamp operation limits the quantized value to the range of N-bit integers, that is, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP. In the above formula, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP and ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP are the learnable clipping strengths of the upper and lower bounds of the weight respectively. Therefore, in the optimization objective functionICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP.

Learnable Equivalence Transformation (LET)

In addition to optimizing the clipping threshold to achieve LWC with weights more suitable for quantization, OmniQuant further reduces activation through LET The difficulty of quantifying the value. Considering that outliers in LLM activation values ​​exist in specific channels, previous methods such as SmoothQuant [3], Outlier Supression [4] transfer the difficulty of quantization from activation values ​​to weights through mathematically equivalent transformations.

However, equivalent transformation parameters obtained by manual selection or greedy search will limit the performance of the quantized model. Thanks to the introduction of Block-wise quantization error minimization, OmniQuant's LET can determine the optimal equivalent transformation parameters in a differentiable way. Inspired by Outlier Suppression ~\citep {outlier-plus}, channel-level scaling and channel-level shifting are used to manipulate the activation distribution, providing an effective solution to the outlier problem in activation values. Specifically, OmniQuant explores equivalent transformations in linear layers and attention operations.

Equivalent transformation in a linear layer: The linear layer accepts an input token sequence ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP, where T is the token length and is the product of the weight matrix ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP and the bias vector ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP. The mathematically equivalent linear layer expression is:

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

Y represents the output, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP and ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP are channel-level scaling and shifting parameters respectively, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP and ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP are equivalent activation, weight and bias respectively, ⊘ and ⊙ represent element-level of division and multiplication. Through the equivalent conversion of the above formula, the activation value is converted into a form that is easier to quantify, at the expense of increasing the difficulty of quantifying the weight. In this sense, LWC can improve the model quantization performance achieved by LET because it makes the weights easier to quantify. Finally, OmniQuant quantizes the transformed activations and weights as follows

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

where Q_a is the ordinary MinMax quantizer and Q_w is the quantizer with learnable weight clipping (i.e. MinMax quantizer of the proposed LWC).

Equivalent transformation in attention operations: In addition to linear layers, attention operations also occupy most of the calculations of LLM. Furthermore, the autoregressive inference mode of LLM requires storing a key-value (KV) cache for each token, which results in huge memory requirements for long sequences. Therefore, OmniQuant also considers quantizing the Q/K/V matrix in autonomous force calculations to low bits. Specifically, the learnable equivalent transformation in the self-attention matrix can be written as:

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

where ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP is the scaling factor. The quantitative calculation in self-attention calculation is expressed as ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP. Here OmniQuant also uses the MinMax quantization scheme as ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP to quantize the ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP matrix. Therefore, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP in the objective function is ultimately optimized.

Pseudocode

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

The pseudo algorithm of OmniQuant is shown in the figure above. Note that the extra parameters introduced by LWC and LET can be eliminated after the model is quantized, that is, OmniQuant does not introduce any additional overhead to the quantized model, so it can be directly adapted to existing quantization deployment tools.

Experimental performance

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

The above figure shows the experimental results of OmniQuant on the LLaMA model with only weight quantization results, more OPT models See the original text for detailed results. As can be seen, OmniQuant consistently outperforms previous models in various LLM models (OPT, LLaMA-1, LLaMA-2) and diverse quantization configurations (including W2A16, W2A16g128, W2A16g64, W3A16, W3A16g128, W4A16 and W4A16g128) LLM is a weight quantification method only. At the same time, these experiments demonstrate the versatility of OmniQuant and its ability to adapt to a variety of quantification configurations. For example, while AWQ [5] is particularly effective at group quantization, OmniQuant shows superior performance in both channel-level and group-level quantization. Additionally, as the number of quantization bits decreases, OmniQuant’s performance advantages become even more apparent.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

In a setting where both weights and activations are quantized, the main focus of the experiment is on W6A6 and W4A4 quantization. W8A8 quantization was excluded from the experimental setup because previous SmoothQuant achieved almost lossless W8A8 model quantization compared to full-precision models. The above figure shows the experimental results of OmniQuant's quantification of weight activation values ​​on the LLaMA model. Notably, OmniQuant significantly improves the average accuracy across different models of W4A4 quantification, ranging from 4.99% to 11.80%. Especially in the LLaMA-7B model, OmniQuant even surpasses the recent quantization-aware training method LLM-QAT [6] by a significant gap of 6.22%. This improvement demonstrates the effectiveness of introducing additional learnable parameters, which is more beneficial than the global weight adjustments employed in quantization-aware training.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

Meanwhile, models quantified using OmniQuant can be seamlessly deployed on MLC-LLM [7]. The above figure shows the memory requirements and inference speed of the LLaMA series quantization model on NVIDIA A100-80G.

Weights Memory (WM) represents quantized weight storage, while Running Memory (RM) represents memory during inference, the latter being higher because certain activation values ​​are retained. Inference speed is measured by generating 512 tokens. It is obvious that the quantized model significantly reduces memory usage compared to the 16-bit full-precision model. Furthermore, W4A16g128 and W2A16g128 quantization nearly doubles the inference speed.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化,已集成进商用APP

It is worth noting that MLC-LLM [7] also supports the deployment of OmniQuant quantification models on other platforms, including Android phones and IOS phones. As shown in the figure above, the recent application Private LLM uses the OmniQuant algorithm to complete the memory-efficient deployment of LLM on multiple platforms such as iPhone, iPad, macOS, etc.

Summary

OmniQuant is an advanced large language model quantization algorithm that advances quantization to a low-bit format. The core principle of OmniQuant is to retain the original full-precision weights while adding learnable quantization parameters. It utilizes learnable weight connections and equivalent transformations to optimize the quantization compatibility of weights and activation values. While incorporating gradient updates, OmniQuant maintains training time efficiency and data efficiency comparable to existing PTQ methods. In addition, OmniQuant ensures hardware compatibility as its added trainable parameters can be incorporated into the original model without any additional overhead.

Reference

[1] Pact: Parameterized clipping activation for quantized neural networks.

[2] LSQ: Learned step size quantization.

[3] Smoothquant: Accurate and efficient post-training quantization for large language models.

[4] Outlier suppression : Accurate quantization of large language models by equivalent and optimal shifting and scaling.

[5] Awq: Activation-aware weight quantization for llm compression and acceleration.

[6] Llm-qat: Data-free quantization aware training for large language models.

[7] MLC-LLM :https://github.com/mlc-ai/mlc-llm

The above is the detailed content of ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete