New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16-AI-php.cn

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 29, 2024 am 09:29 AM

projectByteDance

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

As deep learning large language models become more and more popular, large language models become larger and larger, making their reasoning Costs have also gone up. Model quantification has become a popular research topic.

Recently, ByteDance has launched a new quantification idea, abandoning the traditional quantification paradigm and modeling quantification tasks from the perspective of mathematical optimization. The article is posted on arXiv, and the code has been open sourced. All results in the article can be reproduced with one click. This quantification idea is based on mathematical optimization, modeling the quantification task from the perspective of mathematical optimization, and finding the optimal solution by maximizing the objective function or minimizing the loss function. This idea has achieved good results in experiments and achieved satisfactory results.

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

Paper link: https://arxiv.org/abs/2404.12759
Project link: https://github.com/bytedance/decoupleQ
W2 operator: https://github.com/NVIDIA/TensorRT-LLM/pull/1568

1. Background

The rapid development of large-scale technology has made the cost of reasoning higher and higher. Model quantification, as a technical solution to reduce inference costs, has received more and more attention and research. However, under the traditional quantization paradigm, the accuracy of the model drops rapidly at very low bits. Based on this, the authors proposed a new quantification idea, decoupling the model parameters into an integer part and a floating point part, and modeling the quantification task from the perspective of mathematical optimization, so that the model can still maintain Higher accuracy. The advantage of this is obvious. We no longer need to focus on quantization-specific issues, such as how to deal with sensitive channels, how to deal with outliers, etc. Instead, we only need to mathematically model the quantification problem, find a suitable optimization objective function, and then to solve this function.

2. Traditional quantification

Traditionally, our quantification idea for a model is:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

where , New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 is the floating point weights of the model before quantization; s and z are a linear transformation coefficient, indicating scale and zero point; α and β are the upper and lower bounds of the integer representation range. For example, for int4 quantization, α = - 8, β = 7; New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 represents the rounding function, which is generally rounded to the nearest integer.

Regarding the values of s and z, generally speaking, for asymmetric quantization, we can take:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

In this way, one will be distributed in ## The floating point weights of # are linearly mapped to the interval range of New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 .

In inverse quantization, the following formula is generally used:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

In this traditional quantization scheme, we need to pay attention to many detailed issues unique to quantization. , for example, for sensitive channels, we have sensitive channel processing methods; for outliers, we have outlier processing methods. This processing paradigm of treating headaches and treating headaches is difficult to cope with complex and ever-changing business scenarios. Bytedance researchers try to abstract these issues and look at quantification issues from a macro perspective. We only need to establish an abstract optimization objective function and then solve this objective function.

3.decoupleQ

Observing the role of equations (1)~(3) in quantification, if we change our thinking, we will find that we actually do not need to know equations (1) and (2). After we quantify a large model and deliver it to downstream engine students, we only need to know New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 and (s,z) in equation (3). In other words, (s,z) in equation (3) can be regarded as the coefficient of an ordinary affine transformation, and there is no need to retain its meaning in equation (2). The affine transformation coefficient can be obtained through mathematical optimization methods.

Further exploration in formula (3), we can decouple the parameters of a large model into the integer part New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 and the floating point part (s,z). After such decoupling, the process of model quantization can be regarded as a process of solving the integer part and the floating point part (s,z) of the model. We can alternately optimize the solution. To this end, the optimization objective function and its constraints must be determined.

For a linear layer, we can construct the following optimization objective function:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

Where, New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 is the input of the layer, is a Symmetric matrix (if none of the columns of X are all zero, then H is a positive definite symmetric matrix).

Generally speaking, in order to improve the quantization accuracy, we can use per-channel quantization on the weight of the model. In per-channel quantization, when optimizing equation (4), each column of New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 is optimized independently. So we only need to focus on one of the columns.

At this point, the optimization goal can be written as follows: (For the sake of simplicity of notation, the symbols are redefined in the article):

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

The optimization objective function is

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

Among them, w is a certain column in New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 , and b is the corresponding column in . The definitions of other symbols are the same as before.

In fact, the optimization objective functions (6) and (4) are completely consistent, New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 is the inverse quantization process.

Converting a quantitative problem into a mathematical optimization problem in the form of (5) is the key that distinguishes decoupleQ from traditional quantitative papers. This transformation allows us to only focus on solving equation (5), and no longer need to deal with the minutiae of quantification itself, such as outlier, etc.

It is not easy to solve equation (5) because there are constraints on New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 , especially the non-convex constraint . In the paper, the author gives an alternative solution idea, that is, after obtaining good initialization about (s,z) and w, iteratively solve (s,z) and w alternately. When solving (s,z), notice that equation (5) is an unconstrained quadratic form with respect to (s,z). You can directly derive the objective function and make the derivative zero to obtain the analytical solution. When solving w, the author adopts two levels of approximation. The first level approximation has higher convergence, but the solution is slow; the second level approximation samples the idea of GPTQ [1], which has slightly worse convergence, but the solution is faster.

In order to further improve the accuracy of the quantization model, the author pointed out that in addition to mse minimization at the layer level, mse minimization can also be done at the block level, that is:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

In this step, the author quantizes each linear layer at a transformer block level, fixes their integer part New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 , and fine-tunes the floating point part (s, z) and the related parameters of the layer norm. . Experimental results show that this step of fine-tuning can further improve the accuracy of the model.

4. W2 operator implementation

To perform inference on the quantized model, the support of quantized operators is required. There is no ready-made w2a16 operator available in the industry. , the authors developed the Gemm cuda kernel of w2 based on the w4 operator in Tensorrt-LLM, realizing efficient inference of the w2a16 model.

The quantization model itself is loaded and stored in the video memory in the form of 2bit weight, so it will occupy a relatively small video memory. Our cuda kernel loads the 2-bit weight into the register at runtime, and then uses hardware instructions to efficiently convert it into bf16 form and perform gemm operations with activation. Because our scenario is limited by latency, the batchsize in the generation stage is relatively small. At this time, matrix multiplication is limited by weight memory access. This implementation will greatly reduce the amount of memory access and improve the performance of the model. During the implementation process, algorithm search and SpiltK Parallel Reduce are combined to further improve the performance of the model. According to actual measurements, when batchsize=1, the performance of w2a16 Gemm on the L card can be improved by 1.4x-1.7x compared to w4a16.

Operator link: https://github.com/NVIDIA/TensorRT-LLM/pull/1568

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

# The implementation principle of kernel

5. Experiment

The author gives Bytedance’s internal ASR experimental results and open source experiments in the article Comparison results:

The internal experimental results are:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

In this table, the author uses word err rate (WER) to measure the accuracy of ASR. The authors attempted to quantify the model to W2A16g64 using different methods. The wer of the floating-point model before quantization is 6.68%. After quantization using GPTQ [1], it is 6.83%. The wer of decoupleQ with block minimization after quantization is 6.70%. This result is very similar to the wer of the floating-point model before quantization. near. It also reports the time required for quantification. The price of high quantization accuracy is that quantization takes a long time. In actual business, after using decoupleQ to quantify the model, the integer part is fixed, and the labeled data set is used to fine-tune the scale and zero, and the accuracy of the model is further improved.

The results of the open source comparison experiment are:

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

This table is a comparison of the quantitative results of decoupleQ and other methods on Llama-1/2. Perplexity (PPL) is used as the evaluation index. It can be seen that under the same quantization configuration, the PPL of deoucpleQ will be lower than other methods most of the time.

6. Business benefits

decoupleQ Quantification technology is now widely used in ByteDance’s voice sector. It has been launched in speech generation models (Text-to-Speech), speech recognition models (automatic speech recognition), etc., and has been implemented in products such as Doubao, Feishu, and Douyin. A large number of online businesses show that based on decoupleQ quantification, the inference accuracy of W4A16 is completely on par with fp16/bf16 inference; the accuracy of W2A16 is only slightly worse than the fp16/bf16 accuracy (after the floating point part sft, the accuracy is on the same level as fp16/bf16) ). Although the paper only introduces weight-only quantification, in actual business, after weight is well quantified, activation quantification can be much simpler.

Compared with fp16, w8fp16, and w4fp16, we have achieved good acceleration effects in terms of hardware acceleration. In small batches, the performance of w2 matrix multiplication is 5-6 times higher than that of fp16, and 1.5-1.7 times higher than that of w4. . In terms of internal business models, w2fp16 has a performance improvement of 3-5 times compared to fp16, and a performance improvement of 1.25-1.4 times compared to w4fp16. It will also significantly reduce the memory occupied by the model weight, providing better memory utilization for the runtime. Lots of space.

New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

7. Summary and Discussion

In the summary and discussion section, the author also pointed out Two current risks of decoupleQ are eliminated:

1. decoupleQ aims to use mathematical optimization methods to minimize the L2 loss before and after quantization. However, minimizing L2 loss at the layer level or block level may not necessarily represent the optimal accuracy of the final model;

2. During the optimization process of equations (5) and (7), when solving## When # and (s,z), only a small part of the calibration data is solved, which makes decoupleQ easy to overfit the calibration data. New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16

Nonetheless, the author also pointed out that the idea of decoupling the model parameters into the integer part and the floating point part is very meaningful. If a labeled data set exists, we can fix the integer part after quantization and use the labeled data set to specifically train (s, z) to further improve the accuracy of the model. This not only ensures the generalization performance of the model (derived from the fixed integer part New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16 ), but also can exert its ability on specific subtasks (derived from the fine-tuned floating point part). In ByteDance's actual business, after the previous version of the model is quantified and put online, when the next version is updated, only the floating point part of the model can be trained.

^References:

^{【1】Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq : Accurate quantization for generative pretrained transformers. In The Eleventh International Conference on Learning Representations, 2022.}

^{##【2】Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023}

【3】Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.

The above is the detailed content of New ideas for quantification of byte open source large models, the accuracy of the 2-bit quantization model is on par with fp16. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

An easy-to-understand explanation of how to save conversation history (conversation log) in ChatGPT!May 16, 2025 am 05:41 AM

Various ways to efficiently save ChatGPT dialogue records Have you ever thought about saving a ChatGPT-generated conversation record? This article will introduce a variety of saving methods in detail, including official functions, Chrome extensions and screenshots, etc., to help you make full use of ChatGPT conversation records. Understand the characteristics and steps of various methods and choose the one that suits you best. [Introduction to the latest AI proxy "OpenAI Operator" released by OpenAI] (The link to OpenAI Operator should be inserted here) Table of contents Save conversation records using ChatGPT Export Steps to use the official export function Save ChatGPT logs using Chrome extension ChatGP

Create a schedule with ChatGPT! Explaining prompts that can be used to create and adjust tablesMay 16, 2025 am 05:40 AM

Modern society has a compact pace and efficient schedule management is crucial. Work, life, study and other tasks are intertwined, and prioritization and schedules are often a headache. Therefore, intelligent schedule management methods using AI technology have attracted much attention. In particular, ChatGPT's powerful natural language processing capabilities can automate tedious schedules and task management, significantly improving productivity. This article will explain in-depth how to use ChatGPT for schedule management. We will combine specific cases and steps to demonstrate how AI can improve daily life and work efficiency. In addition, we will discuss things to note when using ChatGPT to ensure safe and effective use of this technology. Experience ChatGPT now and get your schedule

How to connect ChatGPT with spreadsheets! A thorough explanation of what you can doMay 16, 2025 am 05:39 AM

We will explain how to link Google Sheets and ChatGPT to improve business efficiency. In this article, we will explain in detail how to use the add-on "GPT for Sheets and Docs" that is easy for beginners to use. No programming knowledge is required. Increased business efficiency through ChatGPT and spreadsheet integration This article will focus on how to connect ChatGPT with spreadsheets using add-ons. Add-ons allow you to easily integrate ChatGPT features into your spreadsheets. GPT for Shee

6 Investor Predictions For AI In 2025May 16, 2025 am 05:37 AM

There are overarching trends and patterns that experts are highlighting as they forecast the next few years of the AI revolution. For instance, there's a significant demand for data, which we will discuss later. Additionally, the need for energy is d

Use ChatGPT for writing! A thorough explanation of tips and examples of prompts!May 16, 2025 am 05:36 AM

ChatGPT is not just a text generation tool, it is a true partner that dramatically increases writers' creativity. By using ChatGPT for the entire writing process, such as initial manuscript creation, ideation ideas, and stylistic changes, you can simultaneously save time and improve quality. This article will explain in detail the specific ways to use ChatGPT at each stage, as well as tips for maximizing productivity and creativity. Additionally, we will examine the synergy that combines ChatGPT with grammar checking tools and SEO optimization tools. Through collaboration with AI, writers can create originality with free ideas

How to create graphs in ChatGPT! No plugins required, so it can be used for Excel too!May 16, 2025 am 05:35 AM

Data visualization using ChatGPT: From graph creation to data analysis Data visualization, which conveys complex information in an easy-to-understand manner, is essential in modern society. In recent years, due to the advancement of AI technology, graph creation using ChatGPT has attracted attention. In this article, we will explain how to create graphs using ChatGPT in an easy-to-understand manner even for beginners. We will introduce the differences between the free version and the paid version (ChatGPT Plus), specific creation steps, and how to display Japanese labels, along with practical examples. Creating graphs using ChatGPT: From basics to advanced use ChatG

Pushing The Limits Of Modern LLMs With A Dinner Plate?May 16, 2025 am 05:34 AM

In general, we know that AI is big, and getting bigger. It’s fast, and getting faster. Specifically, though, not everyone’s familiar with some of the newest hardware and software approaches in the industry, and how they promote better results. Peopl

Archive your ChatGPT conversation history! Explaining the steps to save and how to restore itMay 16, 2025 am 05:33 AM

ChatGPT Dialogue Record Management Guide: Efficiently organize and make full use of your treasure house of knowledge! ChatGPT dialogue records are a source of creativity and knowledge, but how can growing records be effectively managed? Is it time-consuming to find important information? don’t worry! This article will explain in detail how to effectively "archive" (save and manage) your ChatGPT conversation records. We will cover official archive functions, data export, shared links, and data utilization and considerations. Table of contents Detailed explanation of ChatGPT's "archive" function How to use ChatGPT archive function Save location and viewing method of ChatGPT archive records Cancel and delete methods for ChatGPT archive records Cancel archive Delete the archive Summarize Ch

See all articles