


The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy
Since the emergence of GPT-3 in 2020, the popularity of ChatGPT has once again brought the GPT family’s generative large-scale language models into the spotlight, and they have shown strong performance in various tasks.
But the huge scale of the model also brings about an increase in computing costs and an increase in deployment difficulty.
For example, the GPT‑175B model occupies a total of at least 320GB of storage space in half-precision (FP16) format. During inference, at least five A100 GPUs with 80 GB storage space are required.
Model compression is currently a commonly used method to reduce the computational cost of large models, but so far, almost all existing GPT compression methods focus on quantification. (quantization), that is, reducing the accuracy of the numerical representation of a single weight.
Another model compression method is pruning, which removes network elements, ranging from individual weights (unstructured pruning) to higher-granular components such as weight matrices of the entire row/column (structured pruning). This approach works well in vision and smaller-scale language models, but it results in a loss of accuracy, requiring extensive retraining of the model to restore accuracy, so the cost becomes again when it comes to large-scale models like GPT. Too expensive. Although there are some single-shot pruning methods that can compress the model without retraining, they are too computationally intensive and difficult to apply to models with billions of parameters.
So for a large model of the size of GPT-3, is there a way to accurately prune it while maintaining minimal accuracy loss and reducing computational costs?
Recently, two researchers from the Austrian Institute of Science and Technology (ISTA), Elias Frantar and Dan Alistarh, collaborated on a study that, for the first time, targeted a model scale of 10 to 100 billion parameters. An accurate single-shot pruning method SparseGPT is proposed.
There are currently many methods to quantify post-training of GPT-scale models, such as ZeroQuant, LLM.int8() and nuQmm, etc., but activation quantization may be difficult due to the presence of abnormal features. GPTQ utilizes approximate second-order information to accurately quantize weights to 2‑4 bits, suitable for the largest models, and when combined with efficient GPU cores, can lead to 2‑5x inference acceleration.
But since GPTQ focuses on sparsification rather than quantification, SparseGPT is a complement to the quantification method, and the two can be applied in combination.
In addition, in addition to unstructured pruning, SparseGPT is also suitable for semi-structured patterns, such as the popular n:m sparse format, which can be used in a ratio of 2:4 on Ampere NVIDIA GPUs Achieve acceleration.
SparseGPT: High sparsification level, low precision loss
After evaluating the effectiveness of the SparseGPT compression model, researchers found that large languages The difficulty of model sparsification is proportional to the model size. Compared with the existing magnitude pruning (Magnitude Pruning) method, using SparseGPT can achieve a higher degree of model sparseness while maintaining a minimum loss of accuracy.
The researchers implemented SparseGPT on PyTorch and used HuggingFace’s Transformers library to process the model and dataset, all on a single NVIDIA A100 GPU with 80GB of memory. Under such experimental conditions, SparseGPT can achieve complete sparsification of a 175 billion parameter model in approximately 4 hours.
The researchers sparse Transformer layers sequentially, which significantly reduces memory requirements and also greatly improves the accuracy of processing all layers in parallel. All compression experiments were performed in one go without any fine-tuning.
The evaluation objects are mainly OPT series models, which include a set of models from 125 million to 175 billion parameters, making it easy to observe the scaling performance of pruning relative to the model size. Additionally, 176 billion parameter variants of BLOOM were analyzed.
In terms of data sets and evaluation indicators, the experiment used the perplexity of the original WikiText2 test set to evaluate the accuracy of the SparseGPT compression method. At the same time, in order to increase the interpretability, some ZeroShot accuracy metric. Additionally, the evaluation focuses on the accuracy of the sparse model relative to the dense model baseline, rather than on absolute numbers.
The researchers pruned all linear layers of the entire OPT model series (excluding standard embeddings and headers) to achieve 50% unstructured sparsity, full 4 :8 or full 2:4 semi-structured sparsity, the result is as shown below.
It can be seen that the accuracy of the model compressed using amplitude pruning is poor at all sizes, and the model becomes smaller. The larger the value, the greater the accuracy decreases.
The trend of the model compressed using SparseGPT is different. Under 2.7 billion parameters, the perplexity loss is
Larger models are more likely to be sparsified
A general trend is that larger models are more likely to be sparsified. At sparsity levels, the relative accuracy drop of sparse models relative to dense models shrinks as the model size increases. The authors speculate that this may be due to their higher degree of parameterization and overall greater noise immunity.
Compared to the dense model baseline, at the maximum scale, when using SparseGPT to compress the model to 4:8 and 2:4 sparsity, the perplexity increases are only 0.11 and 0.39 respectively. . This result means that we can achieve a 2x speedup in practice, and commercial NVIDIA Ampere GPUs already support 2:4 sparsity.
The author studied the relationship between the performance of two hundred billion models, OPT-175B and BLOOM-176B, and the degree of sparsity brought about by using SparseGPT. The results are shown in the figure below.
For the BLOOM-176B model, although amplitude pruning can achieve 30% sparsity without significant accuracy loss, in comparison, SparseGPT can achieve 50% sparsity, a 1.66x improvement. Moreover, at 80% sparsity, the perplexity of the model compressed using SparseGPT still remains at a reasonable level, but when amplitude pruning reaches 40% sparsity of OPT and 60% sparsity of BLOOM, the perplexity is already > 100.
Additionally, SparseGPT is able to remove approximately 100 billion weights from these models, with limited impact on model accuracy.
Finally, this study shows for the first time that a large-scale pre-trained model based on Transformer can be compressed to high sparsity through one-time weight pruning without any retraining and a small accuracy loss. Low.
It is worth noting that SparseGPT’s approach is local: after each pruning step, it performs weight updates designed to preserve the input-output relationships of each layer. These updates is calculated without any global gradient information. Therefore, the high degree of parameterization of large-scale GPT models appears to enable this approach to directly identify sparse accurate models among the "neighbors" of dense pre-trained models.
In addition, because the accuracy indicator (perplexity) used in the experiment is very sensitive, the generated sparse model output seems to be closely related to the output of the dense model.
This research has great positive significance in alleviating the computing power limitations of large models. One future work direction is to study the fine-tuning mechanism of large models to further restore accuracy. At the same time, Expanding the applicability of SparseGPT's methods during model training will reduce the computational cost of training large models.
The above is the detailed content of The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy. For more information, please follow other related articles on the PHP Chinese website!

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.