The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy-AI-php.cn

The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy

王林

Apr 12, 2023 pm 01:01 PM

sparsegptReduce computing powercost

Since the emergence of GPT-3 in 2020, the popularity of ChatGPT has once again brought the GPT family’s generative large-scale language models into the spotlight, and they have shown strong performance in various tasks.

But the huge scale of the model also brings about an increase in computing costs and an increase in deployment difficulty.

For example, the GPT‑175B model occupies a total of at least 320GB of storage space in half-precision (FP16) format. During inference, at least five A100 GPUs with 80 GB storage space are required.

Model compression is currently a commonly used method to reduce the computational cost of large models, but so far, almost all existing GPT compression methods focus on quantification. (quantization), that is, reducing the accuracy of the numerical representation of a single weight.

Another model compression method is pruning, which removes network elements, ranging from individual weights (unstructured pruning) to higher-granular components such as weight matrices of the entire row/column (structured pruning). This approach works well in vision and smaller-scale language models, but it results in a loss of accuracy, requiring extensive retraining of the model to restore accuracy, so the cost becomes again when it comes to large-scale models like GPT. Too expensive. Although there are some single-shot pruning methods that can compress the model without retraining, they are too computationally intensive and difficult to apply to models with billions of parameters.

So for a large model of the size of GPT-3, is there a way to accurately prune it while maintaining minimal accuracy loss and reducing computational costs?

Recently, two researchers from the Austrian Institute of Science and Technology (ISTA), Elias Frantar and Dan Alistarh, collaborated on a study that, for the first time, targeted a model scale of 10 to 100 billion parameters. An accurate single-shot pruning method SparseGPT is proposed.

The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy

##Paper address: https://arxiv.org/pdf/2301.00774.pdf

SparseGPT can prune the GPT series model to 50% sparsity in a single step without any retraining. The largest publicly available model, GPT-175B, achieves this pruning in just a few hours using a single GPU.

Moreover, SparseGPT is also very accurate and can minimize the loss of accuracy. For example, when executing SparseGPT on the currently largest open source models OPT‑175B and BLOOM‑176B, a sparsity of 60% can be achieved while minimizing the loss of accuracy.

Electric Drive SparseGPT Algorithm

Research on very large models has been very active in recent years, but so far, there has not been one with more than 10 billion parameters The model is able to achieve very accurate high sparsification.

Existing methods have too high requirements on computational cost. Taking the most accurate post-training method OBC as an example, it takes more than 1 hour for a billion-parameter model. to perform compression. The fastest known post-training method, AdaPrune, also takes minutes to prune a billion-parameter model, and at this rate, a model at the scale of GPT-3 is estimated to require hundreds of hours (weeks) of computation.

Most existing pruning methods such as gradual magnitude pruning require extensive retraining after the pruning step to restore accuracy, while GPT scale Models usually require a large amount of computation and parameter adjustment for training or fine-tuning, which makes retraining-based methods difficult to apply. Therefore, applying this progressive pruning approach at GPT scale is not feasible.

This work by the ISTA team proposes the SparseGPT method, which can run models with more than 100 billion parameters on a single GPU in a few hours, and is accurate enough to prune the model to 50 %-60% sparsity levels without significantly degrading performance.

The core of SparseGPT is a new large-scale approximate sparse regression algorithm that can be generalized to semi-structured (2:4 and 4:8) patterns and is compatible with existing Compatible with weight quantification methods.

The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy

Most existing pruning methods, such as progressive magnitude pruning, require pruning Steps are followed by extensive retraining to restore accuracy, and GPT-scale models often require a large amount of computation and parameter adjustment for training or fine-tuning, which makes retraining-based methods difficult to apply. Therefore, applying this progressive pruning approach at GPT scale is not feasible.

SparseGPT is a post-training method for GPT-scale models because it does not perform any fine-tuning.

There are currently many methods to quantify post-training of GPT-scale models, such as ZeroQuant, LLM.int8() and nuQmm, etc., but activation quantization may be difficult due to the presence of abnormal features. GPTQ utilizes approximate second-order information to accurately quantize weights to 2‑4 bits, suitable for the largest models, and when combined with efficient GPU cores, can lead to 2‑5x inference acceleration.

But since GPTQ focuses on sparsification rather than quantification, SparseGPT is a complement to the quantification method, and the two can be applied in combination.

In addition, in addition to unstructured pruning, SparseGPT is also suitable for semi-structured patterns, such as the popular n:m sparse format, which can be used in a ratio of 2:4 on Ampere NVIDIA GPUs Achieve acceleration.

SparseGPT: High sparsification level, low precision loss

After evaluating the effectiveness of the SparseGPT compression model, researchers found that large languages The difficulty of model sparsification is proportional to the model size. Compared with the existing magnitude pruning (Magnitude Pruning) method, using SparseGPT can achieve a higher degree of model sparseness while maintaining a minimum loss of accuracy.

The researchers implemented SparseGPT on PyTorch and used HuggingFace’s Transformers library to process the model and dataset, all on a single NVIDIA A100 GPU with 80GB of memory. Under such experimental conditions, SparseGPT can achieve complete sparsification of a 175 billion parameter model in approximately 4 hours.

The researchers sparse Transformer layers sequentially, which significantly reduces memory requirements and also greatly improves the accuracy of processing all layers in parallel. All compression experiments were performed in one go without any fine-tuning.

The evaluation objects are mainly OPT series models, which include a set of models from 125 million to 175 billion parameters, making it easy to observe the scaling performance of pruning relative to the model size. Additionally, 176 billion parameter variants of BLOOM were analyzed.

In terms of data sets and evaluation indicators, the experiment used the perplexity of the original WikiText2 test set to evaluate the accuracy of the SparseGPT compression method. At the same time, in order to increase the interpretability, some ZeroShot accuracy metric. Additionally, the evaluation focuses on the accuracy of the sparse model relative to the dense model baseline, rather than on absolute numbers.

The researchers pruned all linear layers of the entire OPT model series (excluding standard embeddings and headers) to achieve 50% unstructured sparsity, full 4 :8 or full 2:4 semi-structured sparsity, the result is as shown below.

The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy

It can be seen that the accuracy of the model compressed using amplitude pruning is poor at all sizes, and the model becomes smaller. The larger the value, the greater the accuracy decreases.

The trend of the model compressed using SparseGPT is different. Under 2.7 billion parameters, the perplexity loss is

Larger models are more likely to be sparsified

A general trend is that larger models are more likely to be sparsified. At sparsity levels, the relative accuracy drop of sparse models relative to dense models shrinks as the model size increases. The authors speculate that this may be due to their higher degree of parameterization and overall greater noise immunity.

Compared to the dense model baseline, at the maximum scale, when using SparseGPT to compress the model to 4:8 and 2:4 sparsity, the perplexity increases are only 0.11 and 0.39 respectively. . This result means that we can achieve a 2x speedup in practice, and commercial NVIDIA Ampere GPUs already support 2:4 sparsity.

The author studied the relationship between the performance of two hundred billion models, OPT-175B and BLOOM-176B, and the degree of sparsity brought about by using SparseGPT. The results are shown in the figure below.

The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy

##It can be seen that for the OPT-175B model, amplitude pruning can achieve up to 10% sparsity. Then there will be a greater loss of accuracy. SparseGPT can also achieve 60% sparsity with increasing perplexity.

The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy

For the BLOOM-176B model, although amplitude pruning can achieve 30% sparsity without significant accuracy loss, in comparison, SparseGPT can achieve 50% sparsity, a 1.66x improvement. Moreover, at 80% sparsity, the perplexity of the model compressed using SparseGPT still remains at a reasonable level, but when amplitude pruning reaches 40% sparsity of OPT and 60% sparsity of BLOOM, the perplexity is already > 100.

Additionally, SparseGPT is able to remove approximately 100 billion weights from these models, with limited impact on model accuracy.

Finally, this study shows for the first time that a large-scale pre-trained model based on Transformer can be compressed to high sparsity through one-time weight pruning without any retraining and a small accuracy loss. Low.

It is worth noting that SparseGPT’s approach is local: after each pruning step, it performs weight updates designed to preserve the input-output relationships of each layer. These updates is calculated without any global gradient information. Therefore, the high degree of parameterization of large-scale GPT models appears to enable this approach to directly identify sparse accurate models among the "neighbors" of dense pre-trained models.

In addition, because the accuracy indicator (perplexity) used in the experiment is very sensitive, the generated sparse model output seems to be closely related to the output of the dense model.

This research has great positive significance in alleviating the computing power limitations of large models. One future work direction is to study the fine-tuning mechanism of large models to further restore accuracy. At the same time, Expanding the applicability of SparseGPT's methods during model training will reduce the computational cost of training large models.

The above is the detailed content of The first 100 billion model compression algorithm SparseGPT is here, reducing computing power costs while maintaining high accuracy. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

A Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Zend Studio 13.0.1

Powerful PHP integrated development environment

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software