ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP-AI-php.cn

ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP

PHPz

Mar 07, 2024 pm 04:16 PM

iphoneprojectModel quantification

Model quantization is a key technology in model compression and acceleration. It quantizes model weights and activation values to low bits, allowing the model to occupy less memory overhead and speed up inference. For large language models with massive parameters, model quantification is even more important. For example, the 175B parameters of the GPT-3 model consume 350GB of memory when loaded using the FP16 format, requiring at least five 80GB A100 GPUs.

But if the weights of the GPT-3 model can be compressed to 3bit, then a single A100-80GB can be used to load all model weights.

At present, there is an obvious challenge in the existing large-scale language model post-training quantization algorithm, that is, it relies on manual setting of quantization parameters and lacks a corresponding optimization process. This results in existing methods often experiencing performance degradation when performing low-bit quantization. Although quantization-aware training is effective in determining the optimal quantization configuration, it requires additional training costs and data support. Especially in large-scale language models, the amount of calculation itself is already large, which makes the application of quantization-aware training in quantization of large-scale language models more difficult.

This begs the question: Can we achieve the performance of quantization-aware training while maintaining the time and data efficiency of post-training quantization?

In order to deal with the problem of quantization parameter optimization during post-training of large language models, a group of researchers from the Shanghai Artificial Intelligence Laboratory, the University of Hong Kong and the Chinese University of Hong Kong proposed "OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models". This algorithm not only supports the quantization of weights and activations in large language models, but can also adapt to various different quantization bit settings.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

arXiv paper address: https://arxiv.org/abs/2308.13137

OpenReview paper address: https://openreview.net/forum? id=8Wuvhh0LYW

Code address: https://github.com/OpenGVLab/OmniQuant

Framework method

As shown in the figure above, OmniQuant is a differentiable quantization technology for large language models (LLM), supporting both weight-only quantization and weight activation value simultaneous quantization. Moreover, it achieves a high-performance quantization model while maintaining the training time efficiency and data efficiency of post-training quantization. For example, OmniQuant can update the quantization parameters of LLaMA-7B ~ LLaMA70B models within 1-16 hours on a single card A100-40GB.

To achieve this goal, OmniQuant adopts a Block-wise quantization error minimization framework. At the same time, OmniQuant has designed two novel strategies to increase learnable quantization parameters, including learnable weight clipping (LWC) to reduce the difficulty of quantizing weights, and a learnable equivalent transformation (Learnable Equivalent Transformation (LET), further shifts the quantization challenge from activation values to weights.

In addition, all learnable parameters introduced by OmniQuant can be fused and eliminated after quantization is completed, and the quantization model can be deployed on multiple platforms based on existing tools, including GPU, Android, IOS, etc.

Block-wise quantization error minimization

OmniQuant proposes a new optimization process that uses Block-wise quantization error minimization and uses differentiable way to optimize additional quantization parameters. Among them, the optimization objective is formulated as follows:

where F represents the mapping function of a transformer block in LLM, W and ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP represents the weight and activation quantizer respectively, and are the quantization parameters in learnable weight clipping (LWC) and learnable equivalent transformation (LET) respectively. OmniQuant installs Block-wise quantization to sequentially quantize parameters in one Transformer Block before moving to the next. ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

Learnable Weight Clipping (LWC)

Equivalent transformation performs magnitude migration between model weights and activation values. The learnable equivalent transformation adopted by OmniQuant causes the distribution of model weights to continuously change with training during the parameter optimization process. Previous methods of directly learning weight clipping thresholds [1,2] are only suitable when the weight distribution does not change drastically, otherwise it will be difficult to converge. Based on this problem, unlike previous methods that directly learn the weight clipping threshold, LWC optimizes the clipping intensity in the following way:

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

where ⌊⋅⌉ represents the rounding operation. N is the target number of digits. ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP and W represent the quantized and full-precision weights respectively. h is the normalization factor of the weights and z is the zero point value. The clamp operation limits the quantized value to the range of N-bit integers, that is, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP . In the above formula, and are the learnable clipping strengths of the upper and lower bounds of the weight respectively. Therefore, in the optimization objective function.

Learnable Equivalence Transformation (LET)

In addition to optimizing the clipping threshold to achieve LWC with weights more suitable for quantization, OmniQuant further reduces activation through LET The difficulty of quantifying the value. Considering that outliers in LLM activation values exist in specific channels, previous methods such as SmoothQuant [3], Outlier Supression [4] transfer the difficulty of quantization from activation values to weights through mathematically equivalent transformations.

However, equivalent transformation parameters obtained by manual selection or greedy search will limit the performance of the quantized model. Thanks to the introduction of Block-wise quantization error minimization, OmniQuant's LET can determine the optimal equivalent transformation parameters in a differentiable way. Inspired by Outlier Suppression ~\citep {outlier-plus}, channel-level scaling and channel-level shifting are used to manipulate the activation distribution, providing an effective solution to the outlier problem in activation values. Specifically, OmniQuant explores equivalent transformations in linear layers and attention operations.

Equivalent transformation in a linear layer: The linear layer accepts an input token sequence ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP , where T is the token length and is the product of the weight matrix and the bias vector . The mathematically equivalent linear layer expression is:

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

Y represents the output, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP and are channel-level scaling and shifting parameters respectively, and are equivalent activation, weight and bias respectively, ⊘ and ⊙ represent element-level of division and multiplication. Through the equivalent conversion of the above formula, the activation value is converted into a form that is easier to quantify, at the expense of increasing the difficulty of quantifying the weight. In this sense, LWC can improve the model quantization performance achieved by LET because it makes the weights easier to quantify. Finally, OmniQuant quantizes the transformed activations and weights as follows

where Q_a is the ordinary MinMax quantizer and Q_w is the quantizer with learnable weight clipping (i.e. MinMax quantizer of the proposed LWC).

Equivalent transformation in attention operations: In addition to linear layers, attention operations also occupy most of the calculations of LLM. Furthermore, the autoregressive inference mode of LLM requires storing a key-value (KV) cache for each token, which results in huge memory requirements for long sequences. Therefore, OmniQuant also considers quantizing the Q/K/V matrix in autonomous force calculations to low bits. Specifically, the learnable equivalent transformation in the self-attention matrix can be written as:

where ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP is the scaling factor. The quantitative calculation in self-attention calculation is expressed as . Here OmniQuant also uses the MinMax quantization scheme as to quantize the matrix. Therefore, ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP in the objective function is ultimately optimized.

Pseudocode

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

The pseudo algorithm of OmniQuant is shown in the figure above. Note that the extra parameters introduced by LWC and LET can be eliminated after the model is quantized, that is, OmniQuant does not introduce any additional overhead to the quantized model, so it can be directly adapted to existing quantization deployment tools.

Experimental performance

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

The above figure shows the experimental results of OmniQuant on the LLaMA model with only weight quantization results, more OPT models See the original text for detailed results. As can be seen, OmniQuant consistently outperforms previous models in various LLM models (OPT, LLaMA-1, LLaMA-2) and diverse quantization configurations (including W2A16, W2A16g128, W2A16g64, W3A16, W3A16g128, W4A16 and W4A16g128) LLM is a weight quantification method only. At the same time, these experiments demonstrate the versatility of OmniQuant and its ability to adapt to a variety of quantification configurations. For example, while AWQ [5] is particularly effective at group quantization, OmniQuant shows superior performance in both channel-level and group-level quantization. Additionally, as the number of quantization bits decreases, OmniQuant’s performance advantages become even more apparent.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

In a setting where both weights and activations are quantized, the main focus of the experiment is on W6A6 and W4A4 quantization. W8A8 quantization was excluded from the experimental setup because previous SmoothQuant achieved almost lossless W8A8 model quantization compared to full-precision models. The above figure shows the experimental results of OmniQuant's quantification of weight activation values on the LLaMA model. Notably, OmniQuant significantly improves the average accuracy across different models of W4A4 quantification, ranging from 4.99% to 11.80%. Especially in the LLaMA-7B model, OmniQuant even surpasses the recent quantization-aware training method LLM-QAT [6] by a significant gap of 6.22%. This improvement demonstrates the effectiveness of introducing additional learnable parameters, which is more beneficial than the global weight adjustments employed in quantization-aware training.

Meanwhile, models quantified using OmniQuant can be seamlessly deployed on MLC-LLM [7]. The above figure shows the memory requirements and inference speed of the LLaMA series quantization model on NVIDIA A100-80G.

Weights Memory (WM) represents quantized weight storage, while Running Memory (RM) represents memory during inference, the latter being higher because certain activation values are retained. Inference speed is measured by generating 512 tokens. It is obvious that the quantized model significantly reduces memory usage compared to the 16-bit full-precision model. Furthermore, W4A16g128 and W2A16g128 quantization nearly doubles the inference speed.

ICLR 2024 Spotlight | 大语言模型权重、激活的全方位低bit可微量化，已集成进商用APP

It is worth noting that MLC-LLM [7] also supports the deployment of OmniQuant quantification models on other platforms, including Android phones and IOS phones. As shown in the figure above, the recent application Private LLM uses the OmniQuant algorithm to complete the memory-efficient deployment of LLM on multiple platforms such as iPhone, iPad, macOS, etc.

Summary

OmniQuant is an advanced large language model quantization algorithm that advances quantization to a low-bit format. The core principle of OmniQuant is to retain the original full-precision weights while adding learnable quantization parameters. It utilizes learnable weight connections and equivalent transformations to optimize the quantization compatibility of weights and activation values. While incorporating gradient updates, OmniQuant maintains training time efficiency and data efficiency comparable to existing PTQ methods. In addition, OmniQuant ensures hardware compatibility as its added trainable parameters can be incorporated into the original model without any additional overhead.

^Reference

^{[1] Pact: Parameterized clipping activation for quantized neural networks.}

^{[2] LSQ: Learned step size quantization.}

^{[3] Smoothquant: Accurate and efficient post-training quantization for large language models.}

^{[4] Outlier suppression : Accurate quantization of large language models by equivalent and optimal shifting and scaling.}

^{[5] Awq: Activation-aware weight quantization for llm compression and acceleration.}

^{[6] Llm-qat: Data-free quantization aware training for large language models.}

^{[7] MLC-LLM ：https://github.com/mlc-ai/mlc-llm}

The above is the detailed content of ICLR 2024 Spotlight | Large language model weight, activation, all-round low-bit micronization, has been integrated into commercial APP. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete

Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Hot Topics

Where is the login entrance for gmail email?

7640

CakePHP Tutorial

1391

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

150