Home >Technology peripherals >AI >How to Prune LLaMA 3.2 and Similar Large Language Models

How to Prune LLaMA 3.2 and Similar Large Language Models

王林
王林Original
2025-02-25 18:26:08351browse

The size of large models continues to increase performance, but the demand for more efficient and compact models is also growing. However, reducing the size of the model without losing core functions is a complex task.

Techniques such as quantization and pruning are often used to reduce model size, while methods such as knowledge distillation or transfer learning help retain or restore lost functions during the reduction process.

How to Prune LLaMA 3.2 and Similar Large Language Models Among them, Pruning is one of the most effective strategies to reduce the size of the model. Unlike quantization of simplified numerical representations, pruning involves removing specific parts of the model, such as neurons or entire layers. But this effectiveness comes at a cost: pruning is difficult to apply correctly. Not only do you need to determine the part of the model to be pruned, but you also need to carefully select the elements to be removed to minimize the impact on the model's capabilities.

This article focuses on structured width pruning (removing selected neurons) and demonstrates how to effectively apply it to MLP layers with gated linear unit (GLU) structures. By following the outlined steps, you will understand how pruning can significantly reduce model size while retaining its ability to generate coherent output and perform well in critical benchmarks.

What is pruning and how does it affect the model?

As mentioned earlier, pruning involves removing the portion that is considered to contribute the smallest to the final output of the model. By carefully selecting these less important components, pruning aims to create a more efficient model with fewer parameters and lower computational requirements without sacrificing its core capabilities.

The main challenge of pruning is to decide which parts of the model to be removed. Not all parts of the model have the same effect on performance; each part has its own unique role.

To illustrate this, let's examine the structure of the model used in this article: Llama 3.2–1B.

<code>LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)</code>

When examining the structure, we can identify three main modules that can be used as pruning targets: embedding, self-attention mechanism, and MLP layer. In order to decide which parts should be the focus of the pruning process, it is necessary to understand the potential benefits and possible effects.

The first step is to evaluate the size of space these parts occupy in the model to understand the potential reduction.

Parameter distribution analysis

Embed and output layers (embed_tokens, lm_head):

  • 128256 × 2048 ≈ 262M parameters/layer
  • The two layers have a total of 524M parameters

Self Attention Mechanism (self_attn):

  • 16 layers, each layer contains four projection sub-layers
  • Per layer: 2048 × (2048 512 512 2048) ≈ 10.5M parameters
  • Total: 10.5 × 16 ≈ 168M parameters

MLP layer (mlp):

  • 16 layers with GLU structure (_gateproj, _upproj and _downproj)
  • Per layer: 2048 × 8192 2048 × 8192 8192 × 2048 ≈ 50M parameters
  • Total: 50 × 16 ≈ 805M parameters

We can see that MLP layers occupy more than 50% of the model size, so they are explicit pruning candidates. However, it is important to understand the contribution of each section to the model behavior before making this decision.

Impact Analysis

Embing layer is responsible for converting input into dense vector representations that the model can effectively process. Pruning embedding layers can cause the model to lose the ability to understand certain words, or at least reduce the ability to create vectors that correctly capture the meaning of the input semantics. For example, if you want to create a highly specific model that uses only very specific parts of its input vocabulary (for example, a model for financial or medical analysis), pruning this layer might be an option.

Attention mechanism allows the model to focus on the most relevant parts of the input sequence when processing each marker. It calculates weighted importance scores between each pair of markers in the input sequence, allowing the model to capture the context and focus on relevant information. Pruning this section reduces the model's ability to perform tasks that require a broad understanding of the input context, such as text summary or translation. It also affects the consistency of the generated text.

MLP layer Together with attention mechanisms, enhance the model's ability to understand complex patterns through a series of data expansion and contraction. Pruning this section limits the model's response to tasks that are not seen or not covered during training. In other words, it reduces the generalization ability of the model and its ability to provide coherent responses to unfamiliar inputs.

Once you decide which part of the model you want to target, the next step is to determine whether to perform width pruning (removing a single neuron) or depth pruning (removing the entire layer).

As you can see, pruning models are a rather complex process involving many decisions. You must not only evaluate the capabilities of the generated model, but also its training capabilities. These models are designed to be fine-tuned and are often used for specific tasks, so they are more efficient for the specific tasks that create them than the underlying model.

Characteristics of gated linear units

Gated linear unit (GLU) architectures are commonly used in modern neural networks, including LLaMA, Gemma, Mistral, Qwen and similar large language models. GLU introduces an element-by-element gating mechanism that allows the model to selectively filter and control information flow. This architecture consists of pairs of layers, usually: gate_proj, up_proj, and down_proj (as shown in the model structure shown above), which work together to expand and shrink data.

This mechanism allows the model to handle more complex patterns while maintaining efficiency. However, this also means that the layers in the GLU structure are tightly coupled and pruning these layers requires careful consideration.

Any operation on a layer (e.g., removing neurons) must be reflected in its corresponding paired layer. For example, if you remove a neuron from _gateproj, you must remove the same neuron from up_proj and the _downproj layer must be resized accordingly. Most importantly, when calculating the importance of neurons to determine which neurons are retained, neuronal pairs need to be evaluated together.

Destroying the balance of these layers can lead to performance degradation and even the model is completely invalid, even if only a small number of neurons are removed.

Pruning Llama 3.2 Model

The example will be demonstrated using the Llama model, but the code has also been successfully tested on the Gemma and QWen models.

You can access the full code in my Github code base.

GitHub – peremartra/Large-Language-Model-Notebooks-Course: Practical Courses on Large Languages…

The first step I did with the original model in memory was to execute a small prompt and save the result. This allows me to easily, intuitively and quickly check whether the models generated through the pruning process are coherent or, on the contrary, lose the ability to generate understandable text.

I can assure you that in my first attempt, the resulting text undoubtedly indicates a fundamental flaw in the pruning process due to the failure to adhere to the GLU structure of the model.

The original prompt is: "Paris is the capital of...". Let's look at the response of the original model and compare it to the response returned by my first failed pruning attempt.

Basic Model:

"Paris is the capital of France and one of the most visited cities in the world. It is a capital of art, culture, fashion and food. The city has a rich history and is home to many famous landmarks, including… …”

Pruning only 20% of incorrect models:

"Paris is the capital of France. This is the main area of.... This is the city of... France..."

Obviously, something didn't work in the first attempt. This may seem trivial, but an experience check like this can save you a lot of time.

Implementation details

Let's first look at the function responsible for calculating the importance of neurons, which will ultimately determine which neurons remain in the model and which neurons are removed.

<code>LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)</code>

This function receives the weights of the _gateproj layer and _upproj layer, and as I explained, they work in pairs. Therefore, the importance of neurons must be calculated in combination.

The calculation is very simple: it calculates the absolute value of the weight of each neuron. Both positive and negative values ​​are taken into account, because in theory, neurons with the most extreme values ​​have a greater impact on the model's output by significantly changing the values ​​through them.

Here, I must thank MariusZ Kurman for his contribution to incorporating the minimum value into the calculation. While the method works well without them, including them can improve the results.

The importance of each layer is calculated separately, but the function returns a combined value.

<code>LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)</code>

This function creates new, smaller layers while retaining the most important neurons. This process includes:

  • Extract the current weight:
<code>def compute_neuron_pair_importance(gate_weight, up_weight):
    """
    计算神经元对重要性分数(最大绝对权重)
    参数:
    - gate_weight:来自 gate_proj 层的权重矩阵。
    - up_weight:来自 up_weight 层的权重矩阵。
    返回:
    - importance_scores:每个神经元对的重要性分数。
    """
    gate_max_abs = torch.max(gate_weight, dim=1).values + torch.abs(torch.min(gate_weight, dim=1).values)
    up_max_abs = torch.max(up_weight, dim=1).values + torch.abs(torch.min(up_weight, dim=1).values)
    importance_scores = gate_max_abs + up_max_abs
    return importance_scores</code>
  • Computing the importance score of neuron pairs:
<code>def prune_neuron_pairs(mlp, prune_percent):
    """
    减少**gate_proj**、**up_proj**、**down_proj**层的维度,移除不太重要的神经元。
    参数:
    - mlp:要剪枝的层。
    - prune_percent:要剪枝的神经元的百分比。
    返回:
    - new_gate_proj, new_up_proj, new_down_proj:新的剪枝层。
    - k:新的中间大小。
    """
    # 从 MLP 层提取权重
    gate_weight = mlp.gate_proj.weight.data.float()
    up_weight = mlp.up_proj.weight.data.float()

    # 计算重要性分数
    importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)
    original_intermediate_size = gate_weight.size(0)

    # 计算要保留的神经元
    num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size),
                                   original_intermediate_size - 1)
    k = original_intermediate_size - num_neuron_pairs_to_prune

    # 验证检查
    if k < 1:
        raise ValueError("k must be greater than 0")

    # 选择要保留的神经元
    _, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
    indices_to_keep = indices_to_keep.sort().values

    # 创建并填充新层
    new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
    new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
    new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)

    # 将选定的权重复制到新层。
    new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
    new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
    new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

    return new_gate_proj, new_up_proj, new_down_proj, k</code>

Obtain a tensor containing the importance score calculated for each neuron. These scores reflect the contribution of each neuron to the final output, indicating which neurons should be retained.

  • Determine the number of neurons to retain:
<code># 从 MLP 层提取权重
    gate_weight = mlp.gate_proj.weight.data.float()
    up_weight = mlp.up_proj.weight.data.float()</code>

Use the percentage of pruning provided as parameters and the original size of the layer to be preserved to calculate the total number of neurons to be retained.

  • Select the most important neuron:
<code># 计算重要性分数
    importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)
    original_intermediate_size = gate_weight.size(0)</code>

Torch is used to retrieve neurons with the highest importance score, while also placing them from the most important to the least important order. Since torch returns data in descending order, it is rearranged in ascending order using the sort method, which is what we need.

  • Create new, smaller layers:
<code># 计算要保留的神经元
    num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size),
                                   original_intermediate_size - 1)
    k = original_intermediate_size - num_neuron_pairs_to_prune</code>

Create three new layers whose dimensions are adjusted according to the selected index. In _new_gateproj and _new_upproj, the input dimension is preserved while the output dimension is reduced. Instead, in _new_downproj , the input dimension is adjusted while the output dimension remains the same.

  • Copy the selected weight to the new layer:
<code># 选择要保留的神经元
    _, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
    indices_to_keep = indices_to_keep.sort().values</code>

The relevant weights are transferred from the original layer to the new layer, ensuring that only the weights corresponding to the selected neuron are retained.

Now, let's look at the functions responsible for iterating all layers and building the modified model.

<code># 创建并填充新层
    new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
    new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
    new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)</code>

This function iterates over each layer of the model, applies the pruning process and updates the configuration of the model to reflect the new architecture.

If the configuration file is not updated, the model cannot be used after saving, whether on Hugging Face or local. Many libraries (such as Hugging Face's Transformers) rely on model.config to explain the architecture of the model. If the configuration does not match the actual structure, fine-tuning or inference operations performed through these libraries may fail.

Result Analysis

Using this code, I created several models that are available on Hugging Face Hub.

These include:

  • The three models derived from Llama-3.2–1b, neurons in the MLP layer were pruned 20%, 40%, and 60%, respectively.
  • A model based on Gemma-2–2B, pruned 40%.

You can download these models, and in addition to using them, you can also study their architecture and what changes have occurred compared to the original model they are based on.

Let us analyze the architecture changes after applying the Llama3.2–1b model to 20% pruning.

<code># 将选定的权重复制到新层。
 new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
 new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
 new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]</code>

The structure of the model remains unchanged except for the size of the intermediate layer in the MLP block. As you can see, the _gateproj and _upprojproj layers have been reduced from 8192 features to 6554, and the same happens to the _downproj

layers. varied, but in its input features.

This change is exactly the same as the function of the code: modify these layers while retaining neurons that are critical to model performance. If we remove 20% of 8192, we will get 6553.6, which confirms that the correct proportion of neurons have been pruned.

Experience Tips Test

Now, let's see how the pruned model performs in the test prompt:

Paris is the capital of France. It is also one of the most beautiful cities in the world. There are so many things worth seeing and experiencing in Paris that it is impossible to cover them all in one day. But, there are some things...

The response is not exactly the same as the response of the original model, but it maintains coherence. This suggests that the model retains most of its capabilities, and more importantly, it can recover any losses through knowledge distillation or fine-tuning .

EleutherAI / lm-evaluation

In addition to this empirical check, I also evaluated the model using some of the most common benchmarks. Let's analyze how different degrees of pruning affect the performance of the model.

How to Prune LLaMA 3.2 and Similar Large Language ModelsAs we have seen, the effects of pruning are somewhat asymmetric. The BoolQ test evaluated the task did not experience a significant decline, and for models that lost 40% of neurons in the MLP layer, it decreased by only about 2%.

In contrast, the impact on Lambada tests was very significant, with an accuracy drop of more than 50%.

This suggests that the model retains most of its comprehension but is difficult to deal with in tests that require more open-ended generation.

BoolQ presents only text and questions that need to be answered with yes/no to the model. This is a test focused on measuring the model's ability to understand relationships in input text.

Lambada, on the other hand, asks the model to guess the last word of the paragraph, a complex task in which the last word tests the model's ability in complex language modeling.

Hugging Face Open LLM Ranking

The results of pruning to 20% of the model on the Hugging Face Open LLM rankings are even more surprising because it outperforms its base model and the widely used TinyLlama-1.1B-v1.1.

In this chart, we can see the results of the two models.

How to Prune LLaMA 3.2 and Similar Large Language Models By studying this chart, we can draw the following conclusion: The average performance of the model after pruning is better than the basic model (4.86 vs. 4.03). This suggests that the pruning process effectively retains or enhances performance in key areas while reducing redundancy.

Through the research results, we can identify the advantages and disadvantages of the pruning model.

Advantages:

  • IFEval: Significant improvement (19.94 vs. 14.78) indicates that pruning either reduces overfitting or improves the model's ability to extract information efficiently.
  • MUSR: Better performance (4.39 vs. 2.56) Shows that the pruned model better handles tasks that require reasoning for long context or narrative understanding, which may be due to weight concentration .

Disadvantages:

  • BBH: Decreased inference ability of uncertainty (3.19 vs. 4.37) may indicate that pruning reduces the model's ability to handle ambiguous or multi-interpreted scenarios.
  • MMLU-PRO: Declining of Specialized Areas Tasks (1.36 vs. 2.26) may be attributed to the removal of key weights that retain details in a specific area.

Energy Efficiency: The pruned model has slightly higher energy efficiency (0.4 kg vs. 0.42 kg CO₂), which is consistent with the goal of reducing computational overhead while maintaining competitive performance.

A more comprehensive study of the performance of the model in different rankings is needed, but these results suggest that we have a promising model that can be significantly improved with proper knowledge distillation or fine-tuning. Most importantly, these results are consistent with the pruning process performed on the MLP layer.

Conclusion

The pruning process of the model was successful. This approach to handling the GLU layer allows us to perform pruning while retaining most of the model's capabilities, greatly reducing its size and resource consumption.

It is important to note that the test results are obtained by performing any capability recovery process (e.g. Knowledge distillation or fine-tuning ) before the pruning model, This is usually done on pruned models. Future work

There are many pruning techniques worth exploring. Perhaps the most direct thing is the depth pruning, which involves removing layers that contribute the least to model performance.

Another important area of ​​research is the process of knowledge distillation of these pruned models and assess whether they retain the ability to learn new tasks. This may bring their performance closer to the base model, especially in benchmarks where pruned models show maximum losses.

Developing lighter, more efficient models remains an attractive area, especially for companies seeking to deploy LLM capabilities without extensive infrastructure requirements. This work lays the foundation for further research on how to make these powerful models easier to access and deploy.

This article is part of a complete course on large language models and is available on Github. To learn about new article updates, consider following the code base or starring.

This way, you will be notified when adding new content. I am the author of the book "Grand Language Model Project: Application and Implementation of Large Language Model Strategies" published by Apress Publishing House.

I write regularly about generative AI, deep learning, and TensorFlow. Please consider following my account on Medium for updates on new articles. Of course, you are welcome to contact me on LinkedIn.

The above is the detailed content of How to Prune LLaMA 3.2 and Similar Large Language Models. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn