search
HomeTechnology peripheralsAI70 times ultimate compression! No matter how many checkpoints you have on a large model, you won't be afraid.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.
The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The authors of this paper are all from Huawei Noah Laboratory. The first author is Li Wenshuo, and the corresponding authors are Wang Yunhe and Chen Xinghao. In recent years, relevant teams have published a number of representative works at top conferences such as ICML, CVPR, NeurIPS, ICCV, and ECCV. They have produced rich results in fields such as efficient large language models and visual models, and have cooperated with well-known universities and scientific research institutions. Institutional cooperation is extensive.

As the well-deserved “king of traffic” in the current AI industry and academia, large models have attracted a large number of scholars and companies to invest resources in research and training. As the scale grows, system and engineering issues have become unavoidable problems in large model training. For example, during the 54-day training of Llama3.1, the system crashed 466 times, averaging once every 2.78 hours!

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

Then, frequent storage checkpoints are very necessary. But storing checkpoints is also a big project in itself.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

Meta has made a lot of efforts to speed up storage checkpoint time and increase storage frequency to combat frequent system failures. But frequent storage also means a lot of storage resource overhead. Its training cluster is equipped with 240PB SSD to meet this challenge. The cost of storage alone is 100 million yuan!

Huawei Noah’s ExCP method came into being. In order to deal with the huge overhead caused by storage, they proposed extreme compression checkpoint technology, which can losslessly compress the model 70 times, significantly reducing the storage overhead during training.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

The code is currently open source and released under the Apache 2.0 framework. Some partners in the issue have successfully reproduced the results.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

  • Article address: https://arxiv.org/abs/2406.11257
  • Warehouse address: https://github.com/Gaffey/ExCP

The method is also very good Innovation, the article mentioned two important concepts. One is to use the residual information of checkpoints in training to achieve a higher pruning ratio through the sparsity of information on the time series; the other is to combine the optimizer and weights up for compression to achieve an overall high compression rate.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

Specific method

1. Checkpoint residual

During the training process, the current parameters can be regarded as the weights stored in the previous checkpoint plus the successive The sum of gradient updates during iterations is relatively sparse and contains less information. Therefore, compressing this residual can achieve a better compression ratio. On the contrary, the momentum stored in the optimizer is the sliding average of the first and second moments of the gradient. For the first moment, the default parameter of the sliding average is 0.9, which ranges from hundreds to thousands. After the iteration, there is not much correlation with the content stored in the last checkpoint, so the optimizer directly compresses its own value rather than the residual. The final checkpoint to be compressed is expressed as

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

2. Weight-optimizer momentum joint compression

The existing work related to model compression generally only focuses on the inference performance of the model, or the size of the final storage checkpoint of the model, but does not pay attention to the model’s The storage space overhead during the entire training process. Therefore, existing work only compresses the weights, ignoring that common optimizers such as Adam actually store momentum that is twice the number of weights. On the one hand, this work compresses the two together, significantly improving the overall compression ratio; on the other hand, it also uses the correlation between weights and optimizer momentum to further improve each other's compression ratio.

Weight pruning: Since the weight of pruning is the residual value, the second moment of the optimizer momentum can roughly represent the change amplitude of the weight residual value in the past period of time, so the second moment of the optimizer momentum can be used. The order moment is used as an indicator to determine the pruning ratio of different layers. The pruning strategy is shown in the following formula

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid. where W and 70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid. represent the weight and the second moment respectively.


Optimizer momentum pruning: For momentum pruning, you can use the first-order moment as an indicator to perform pruning. There is a brief proof of convergence in the paper. At the same time, if the weight of a position has been pruned, the optimizer momentum of the corresponding position should also be processed simultaneously, so the pruning strategy is as shown in the following formula

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

where 70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid. represents the first-order moment.

3. Overall compression process

The overall compression process is as shown in Algorithm 1, and the steps of calculating weight residual/joint compression/non-uniform quantization/coding compression are sequentially performed to obtain the final Compress the results.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

The process of recovering the complete checkpoint file is as shown in Algorithm 2. After decompression, the floating point result is first recovered from the codebook and subscript stored after non-uniform quantization, and then compared with the benchmark The weights (the original weight of the previous checkpoint or the recovered reconstructed weight) are added to obtain the complete checkpoint file. The process of restoring the checkpoint files in the entire training process is as shown in Algorithm 3. After completing the training, only the random seeds of the initialization weights and the compression results stored at each checkpoint are saved, and then the checkpoints are restored in sequence to obtain the complete A sequence of checkpoints from which one or more checkpoints can be selected to resume training/testing, etc.

Experimental results

The article not only evaluates large language models, this method can also achieve good results on larger visual models such as ViT-L32.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

It can also be seen from the ablation experiment that the use of residual pruning greatly reduces the loss caused by pruning.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

The article also provides examples of question and answer for large language models before and after compression. It can be seen that the compression itself does not cause damage to the question and answer ability of the model.

70 times ultimate compression! No matter how many checkpoints you have on a large model, you wont be afraid.

The above is the detailed content of 70 times ultimate compression! No matter how many checkpoints you have on a large model, you won't be afraid.. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Tool Calling in LLMsTool Calling in LLMsApr 14, 2025 am 11:28 AM

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

How ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthHow ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthApr 14, 2025 am 11:27 AM

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

UN Input On AI: Winners, Losers, And OpportunitiesUN Input On AI: Winners, Losers, And OpportunitiesApr 14, 2025 am 11:25 AM

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Learning Negotiation Skills Via Generative AILearning Negotiation Skills Via Generative AIApr 14, 2025 am 11:23 AM

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

TED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfTED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfApr 14, 2025 am 11:22 AM

The ​TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerJoseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerApr 14, 2025 am 11:21 AM

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

What is Graph Database?What is Graph Database?Apr 14, 2025 am 11:19 AM

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

LLM Routing: Strategies, Techniques, and Python ImplementationLLM Routing: Strategies, Techniques, and Python ImplementationApr 14, 2025 am 11:14 AM

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.