What will be the effect of cutting the alpaca hair of the large model of Llama 2? Today, Princeton University’s Chen Danqi team proposed a large model pruning method called LLM-Shearing, which can achieve better performance than models of the same size with a small amount of calculation and cost.
Since the emergence of large language models (LLMs), they have achieved remarkable results on various natural language tasks. However, large language models require massive computing resources to train. As a result, the industry is increasingly interested in building equally powerful mid-scale models, with the emergence of LLaMA, MPT, and Falcon, enabling efficient inference and fine-tuning. These LLMs of varying sizes are suitable for different use cases, but training each individual model from scratch (even a small 1 billion parameter model) still requires a lot of computing resources , which is still a huge burden for most scientific research institutions. Therefore, in this article, Chen Danqi’s team at Princeton University attempts to solve the following problem: Can existing pre-trained LLM be used to build a smaller, general and performance-effective Competitive LLM while requiring much less computation than training from scratch? Researchers explore the use of structured pruning to achieve their goals. The problem here is that for general-purpose LLM, the pruned model will experience performance degradation, especially if there is no significant computational investment after pruning. The efficient pruning method they used can be used to develop smaller but still performance-competitive LLMs, and training requires significantly less computation than training from scratch.
- Paper address: https://arxiv.org/abs/2310.06694
- Code address: https://github.com/princeton-nlp/LLM-Shearing
- ModelsSheared-LLaMA-1.3B, Sheared-LLaMA-2.7B
Before pruning LLM, researchers identified two key technical challenges. One is how to determine the final pruning structure with powerful performance and efficient reasoning? LLM's current structured pruning technology does not have a specified target structure, resulting in unsatisfactory performance and inference speed of the pruned model; second, how to continue pre-training the pruned model to achieve expected performance? They observed that training with raw pre-training data resulted in different loss reductions across domains compared to training the model from scratch. In response to these two challenges, the researchers proposed the "LLM - shearing" algorithm. This novel pruning algorithm, called "directed structured pruning," prunes the source model to a specified target architecture, which is determined by the configuration of the existing pre-trained model. They show that the pruning method searches for substructures in the source model and maximizes performance under resource constraints. In addition, a dynamic batch loading algorithm is designed, which can load the training data of each domain in proportion according to the loss reduction rate, thereby efficiently utilizing the data and accelerating the overall performance improvement. Finally, the researchers pruned the LLaMA2-7B model into two smaller LLMs, namely Sheared-LLaMA-1.3B and Sheared-LLaMA-2.7B , confirming the effectiveness of its method.
They only used 50 billion tokens (i.e. 5% of the OpenLLaMA pre-training budget) to prune and continue pre-training, but for 11 representative downstream tasks (such as general knowledge, reading comprehension, and world knowledge) and open-ended generated instruction tuning, both models still outperform other popular LLMs of similar size, including Pythia, INCITE, and OpenLLaMA.
But it should be mentioned that when this paper released Sheared-LLaMA-3B, the record of the strongest 3B open source model had been broken by StableLM-3B.
In addition, downstream task performance trajectories indicate that using more tokens to further train the pruned model will bring greater benefits. The researchers only experimented with models up to 7 billion parameters, but LLM-shearing is highly general and can be extended to large language models of any size in future work. Given an existing large model M_S (source model ), the goal of this article is to study how to effectively generate a smaller and stronger model M_T (target model). The study believes that this requires two stages to complete:
- The first stage prunes M_S to M_T. Although this reduces the number of parameters, it Inevitably leads to performance degradation;
- The second stage continues to pretrain M_T to make its performance stronger.
structured pruning A large number of parameters can be removed from the model, thereby compressing the model and accelerating inference. However, existing structured pruning methods can cause models to deviate from conventional architectural configurations. For example, the CoFiPruning method produces models with non-uniform layer configurations, which incurs additional inference overhead compared to standard unified layer configurations. This article extends CoFiPruning to allow source models to be pruned to any target configuration specified. For example, this article uses the INCITE-Base-3B architecture as the target structure when generating the 2.7B model. In addition, this article also learns a set of pruning masks on model parameters of different granularities. The mask variables are as follows:
Each mask variable controls whether to prune or retain relevant substructures. For example, if the corresponding z^layer= 0, this layer needs to be deleted. Figure 2 below illustrates how pruning masks control which structures are pruned.
After pruning, this paper finalizes the pruned architecture by retaining the highest scoring components associated with the mask variables in each substructure and continues using language construction. The model target is used to pre-train the pruned model. This study believes that a large number of pruned models should be Pre-training is necessary to restore model performance. Inspired by other research, this paper proposes a more efficient algorithm, dynamic batch loading, which can simply dynamically adjust the domain scale based on model performance. The algorithm is as follows:
Model configuration: This article uses LLaMA2-7B The model was used as the source model, and then a structured pruning experiment was performed. They compressed LLaMA2-7B into two smaller target sizes of 2.7 B and 1.3 B parameters, and compared the performance of the pruned model with models of the same size. Including OPT-1.3B, Pythia-1.4B, OPT-2.7B, Pythia-2.8B, INCITE-Base-3B, OpenLLaMA-3B-v1, OpenLLaMA-3B-v2. Table 8 summarizes the model architecture details for all these models.
Data: Since the training data of LLaMA2 is not publicly accessible, this article uses the RedPajama dataset. Table 1 provides the pre-training data used by this paper’s model and the baseline model.
Training: We used up to 16 Nvidia A100 GPUs (80GB) in all experiments. SHEARED-LLAMA outperforms equivalently sized LMThis paper shows that Sheared- LLaMA significantly outperforms existing LLMs of similar size while using only a fraction of the computational budget to train these models from scratch. Downstream tasks: Table 2 shows the zero-shot and few-shot performance of Sheared-LLaMA and existing pre-trained models of similar size on downstream tasks.
Instruction Tuning: As shown in Figure 3, the instruction-tuned Sheared-LLaMA achieves a higher winning rate compared to all other pre-trained models of the same scale.
Figure 4 shows that the INCITEBase-3B model starts out with much higher accuracy, but its performance levels off during the ongoing pre-training process.
Finally, the researcher analyzed the advantages of this method. Effectiveness of dynamic batch loadingAmong them, the researchers studied the following three To analyze the effectiveness of dynamic batch loading, we analyze the impact of: (1) the final LM loss across domains, (2) the data usage of each domain throughout the training process, and (3) downstream task performance. The results are based on the Sheared-LaMA-1.3B algorithm. Cross-domain loss difference. The purpose of dynamic batch loading is to balance the loss reduction rate of each domain so that the loss reaches the reference value in approximately the same time. The difference between the model loss (original batch loading and dynamic batch loading) and the reference loss is plotted in Figure 5. In contrast, dynamic batch loading reduces the loss evenly and the difference in loss across domains is also very similar, which shows that the data More efficient use.
Data usage. Table 3 compares RedPajama’s raw data proportions and dynamically loaded domain data usage (Figure 7 shows the changes in domain weights throughout the training process). Dynamic bulk loading increases the weight of the Book and C4 domains compared to other domains, indicating that these domains are more difficult to recover from the pruned model.
Downstream performance. As shown in Figure 6, the pruned model trained using dynamic batch loading achieved better downstream performance compared to the model trained on the original RedPajama distribution. This suggests that the more balanced loss reduction brought about by dynamic batch loading can improve downstream performance.
Comparison with other pruning methodsIn addition, the researchers used LLM- The shearing method is compared with other pruning methods and validation perplexity is reported, which is a strong indicator of overall model capability. Due to computational limitations, the following experiments control the total computational budget of all compared methods rather than running each method to the end. As shown in Table 4, under the same sparsity, the inference throughput of the target pruning model in this article is higher than that of the non-uniform pruning CoFiPruning model, but the perplexity Slightly higher.
Table 5 shows that when the total amount of tokens is controlled , increasing pruning overhead can continuously improve perplexity. However, since pruning is more expensive than continuous pre-training, the researchers allocate 0.4B tokens to pruning.
For more research details, please refer to the original paper. The above is the detailed content of Teach you how to shear "alpaca" step by step, Chen Danqi's team proposed the LLM-Shearing large model pruning method. For more information, please follow other related articles on the PHP Chinese website!