The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
The authors of this paper are all from Huawei Noah Laboratory. The first author is Li Wenshuo, and the corresponding authors are Wang Yunhe and Chen Xinghao. In recent years, relevant teams have published a number of representative works at top conferences such as ICML, CVPR, NeurIPS, ICCV, and ECCV. They have produced rich results in fields such as efficient large language models and visual models, and have cooperated with well-known universities and scientific research institutions. Institutional cooperation is extensive. As the well-deserved “king of traffic” in the current AI industry and academia, large models have attracted a large number of scholars and companies to invest resources in research and training. As the scale grows, system and engineering issues have become unavoidable problems in large model training. For example, during the 54-day training of Llama3.1, the system crashed 466 times, averaging once every 2.78 hours!
Then, frequent storage checkpoints are very necessary. But storing checkpoints is also a big project in itself.
Meta has made a lot of efforts to speed up storage checkpoint time and increase storage frequency to combat frequent system failures. But frequent storage also means a lot of storage resource overhead. Its training cluster is equipped with 240PB SSD to meet this challenge. The cost of storage alone is 100 million yuan! Huawei Noah’s ExCP method came into being. In order to deal with the huge overhead caused by storage, they proposed extreme compression checkpoint technology, which can losslessly compress the model 70 times, significantly reducing the storage overhead during training.
The code is currently open source and released under the Apache 2.0 framework. Some partners in the issue have successfully reproduced the results.
- Article address: https://arxiv.org/abs/2406.11257
- Warehouse address: https://github.com/Gaffey/ExCP
The method is also very good Innovation, the article mentioned two important concepts. One is to use the residual information of checkpoints in training to achieve a higher pruning ratio through the sparsity of information on the time series; the other is to combine the optimizer and weights up for compression to achieve an overall high compression rate.
During the training process, the current parameters can be regarded as the weights stored in the previous checkpoint plus the successive The sum of gradient updates during iterations is relatively sparse and contains less information. Therefore, compressing this residual can achieve a better compression ratio. On the contrary, the momentum stored in the optimizer is the sliding average of the first and second moments of the gradient. For the first moment, the default parameter of the sliding average is 0.9, which ranges from hundreds to thousands. After the iteration, there is not much correlation with the content stored in the last checkpoint, so the optimizer directly compresses its own value rather than the residual. The final checkpoint to be compressed is expressed as
2. Weight-optimizer momentum joint compressionThe existing work related to model compression generally only focuses on the inference performance of the model, or the size of the final storage checkpoint of the model, but does not pay attention to the model’s The storage space overhead during the entire training process. Therefore, existing work only compresses the weights, ignoring that common optimizers such as Adam actually store momentum that is twice the number of weights. On the one hand, this work compresses the two together, significantly improving the overall compression ratio; on the other hand, it also uses the correlation between weights and optimizer momentum to further improve each other's compression ratio. Weight pruning: Since the weight of pruning is the residual value, the second moment of the optimizer momentum can roughly represent the change amplitude of the weight residual value in the past period of time, so the second moment of the optimizer momentum can be used. The order moment is used as an indicator to determine the pruning ratio of different layers. The pruning strategy is shown in the following formula where W and represent the weight and the second moment respectively.
Optimizer momentum pruning: For momentum pruning, you can use the first-order moment as an indicator to perform pruning. There is a brief proof of convergence in the paper. At the same time, if the weight of a position has been pruned, the optimizer momentum of the corresponding position should also be processed simultaneously, so the pruning strategy is as shown in the following formula
where represents the first-order moment. 3. Overall compression process The overall compression process is as shown in Algorithm 1, and the steps of calculating weight residual/joint compression/non-uniform quantization/coding compression are sequentially performed to obtain the final Compress the results.
The process of recovering the complete checkpoint file is as shown in Algorithm 2. After decompression, the floating point result is first recovered from the codebook and subscript stored after non-uniform quantization, and then compared with the benchmark The weights (the original weight of the previous checkpoint or the recovered reconstructed weight) are added to obtain the complete checkpoint file. The process of restoring the checkpoint files in the entire training process is as shown in Algorithm 3. After completing the training, only the random seeds of the initialization weights and the compression results stored at each checkpoint are saved, and then the checkpoints are restored in sequence to obtain the complete A sequence of checkpoints from which one or more checkpoints can be selected to resume training/testing, etc. The article not only evaluates large language models, this method can also achieve good results on larger visual models such as ViT-L32.
It can also be seen from the ablation experiment that the use of residual pruning greatly reduces the loss caused by pruning.
The article also provides examples of question and answer for large language models before and after compression. It can be seen that the compression itself does not cause damage to the question and answer ability of the model. The above is the detailed content of 70 times ultimate compression! No matter how many checkpoints you have on a large model, you won’t be afraid.. For more information, please follow other related articles on the PHP Chinese website!
Statement:The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn