Rumah >Peranti teknologi >AI >Nvidia bermain dengan pemangkasan dan penyulingan: memotong dua parameter Llama 3.1 8B untuk mencapai prestasi yang lebih baik dengan saiz yang sama

Nvidia bermain dengan pemangkasan dan penyulingan: memotong dua parameter Llama 3.1 8B untuk mencapai prestasi yang lebih baik dengan saiz yang sama

WBOYasal: 2024-08-16 16:42:231056semak imbas

Kebangkitan model kecil.

Bulan lepas, Meta mengeluarkan siri model Llama 3.1, yang termasuk model terbesar Meta setakat ini, 405B, serta dua model yang lebih kecil, Jumlah parameter ialah 70 bilion dan 8 bilion masing-masing.

Llama 3.1 dianggap sebagai permulaan era baharu sumber terbuka. Walau bagaimanapun, walaupun model generasi baharu berkuasa dalam prestasi, ia masih memerlukan sejumlah besar sumber pengkomputeran apabila digunakan.

Oleh itu, satu lagi trend telah muncul dalam industri, iaitu membangunkan model bahasa kecil (SLM) yang berfungsi dengan cukup baik dalam banyak tugas bahasa dan juga sangat murah untuk digunakan.

Baru-baru ini, penyelidikan NVIDIA menunjukkan bahawa pemangkasan berat berstruktur digabungkan dengan penyulingan pengetahuan secara beransur-ansur boleh memperoleh model bahasa yang lebih kecil daripada model yang pada mulanya lebih besar. #🎜🎜 ##### 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜#, ketua saintis AI Meta Jann LECun turut memuji kajian itu.

Selepas pemangkasan dan penyulingan, pasukan penyelidik NVIDIA menapis Llama 3.1 8B ke dalam Llama-3.1-Minitron 4B dan menjadikannya sumber terbuka. Ini adalah keluaran pertama Nvidia dalam siri sumber terbuka Llama 3.1.

Llama-3.1-Minitron 4B mengatasi model sumber terbuka terkini dengan saiz yang sama, termasuk Minitron 4B, Phi-2 2.7B, Gemma2 2.6B dan Qwen2-1.5B.

Kertas berkaitan penyelidikan ini telah dikeluarkan seawal bulan lepas.

Pautan kertas: https://www.arxiv.org/pdf/2407.14679#🎜🎜 🎜#

Tajuk kertas: Model Bahasa Padat melalui Pemangkasan dan Penyulingan Pengetahuan 英伟达玩转剪枝、蒸馏：把Llama 3.1 8B参数减半，性能同尺寸更强

#Pemangkasan dan Penyulingan 🎜#
Pemangkasan menjadikan model lebih kecil dan lebih ramping, yang boleh dicapai dengan mengalihkan lapisan (pencantas kedalaman) atau mengeluarkan neuron dan kepala perhatian dan membenamkan saluran (pencantasan lebar). Pemangkasan biasanya disertai dengan beberapa tahap latihan semula untuk memulihkan ketepatan.
Penyulingan model ialah teknik untuk memindahkan pengetahuan daripada model kompleks yang besar (selalunya dipanggil model guru) kepada model pelajar yang lebih kecil dan ringkas. Matlamatnya adalah untuk mencipta model yang lebih cekap yang mengekalkan banyak kuasa ramalan model asal yang lebih besar sambil berjalan lebih pantas dan menggunakan lebih sedikit sumber.
Terdapat dua kaedah penyulingan utama: penalaan halus SDG dan penyulingan pengetahuan klasik Kedua-dua kaedah penyulingan ini adalah pelengkap. Artikel ini memfokuskan kepada kaedah penyulingan pengetahuan klasik.

NVIDIA menggunakan kaedah yang menggabungkan pemangkasan dan penyulingan pengetahuan klasik untuk membina model besar Rajah berikut menunjukkan proses pemangkasan dan penyulingan model tunggal (atas) dan rantaian pemangkasan dan penyulingan model ( Bawah. ). Proses khusus adalah seperti berikut:

1 NVIDIA bermula dengan model 15B, menilai kepentingan setiap komponen (lapisan, neuron, kepala dan saluran benam), dan kemudian menyusun dan mencantas model untuk dibuat. ia Saiz sasaran dicapai: model 8B.

2 Kemudian latihan semula ringan dilakukan menggunakan penyulingan model, dengan model asal sebagai guru dan model cantas sebagai pelajar.

3 Selepas latihan, ambil model kecil (8B) sebagai titik permulaan, cantas dan suling menjadi model 4B yang lebih kecil.

Perkara yang perlu diambil perhatian ialah sebelum memangkas model, anda perlu memahami bahagian model mana yang penting. NVIDIA mencadangkan strategi penilaian kepentingan tulen berasaskan pengaktifan yang mengira maklumat secara serentak dalam semua dimensi yang berkaitan (saluran kedalaman, neuron, kepala dan benam), menggunakan set data penentukuran kecil 1024 sampel, dan hanya penyebaran ke hadapan diperlukan. Pendekatan ini lebih mudah dan lebih menjimatkan kos daripada strategi yang bergantung pada maklumat kecerunan dan memerlukan perambatan belakang.

Semasa pemangkasan, anda boleh silih berganti antara pemangkasan dan anggaran kepentingan untuk paksi atau gabungan paksi tertentu. Kajian empirikal menunjukkan bahawa menggunakan anggaran kepentingan tunggal adalah mencukupi dan anggaran berulang tidak membawa faedah tambahan. 英伟达玩转剪枝、蒸馏：把Llama 3.1 8B参数减半，性能同尺寸更强

Latihan semula menggunakan penyulingan pengetahuan klasik

Figure 2 below shows the distillation process, where the N-layer student model (the pruned model) is distilled from the M-layer teacher model (the original unpruned model). The student model is learned by minimizing a combination of embedding output loss, logit loss, and Transformer encoder-specific losses mapped to student blocks S and teacher blocks T. Figure 2: Distillation training loss.

Best Practices for Pruning and DistillationBased on extensive ablation research on pruning and knowledge distillation in compact language models, NVIDIA summarizes its learning results into the following structured compression best practices.
One is to adjust the size.
To train a set of LLMs, train the largest one first, then iteratively prune and distill to obtain smaller LLMs.

If using a multi-stage training strategy to train the largest model, it is best to prune and retrain the model obtained in the last stage of training.

Prune the available source model closest to the target size.

The second is pruning.

Prioritize width pruning over depth pruning, which works well for models under 15B parameter size.

Use single-shot importance estimation as there is no benefit in iterative importance estimation.

The third is to retrain.

Retrain using only distillation loss instead of regular training.

Use logit, intermediate state and embedded distillation when depth is significantly reduced.

Use logit-only distillation when depth does not decrease significantly.

Llama-3.1-Minitron: Putting best practices into action

Meta recently launched the powerful Llama 3.1 family of open source models that rival closed source models in many benchmarks. Llama 3.1's parameters range from a massive 405B to 70B and 8B.

With the experience of Nemotron distillation, NVIDIA set out to distill the Llama 3.1 8B model into a smaller and more efficient 4B model, taking the following measures:
Teacher fine-tuning

Depth-only pruning

Width -only pruning

Accuracy Benchmark

Performance Benchmark

Teacher Fine-tuning

To correct the distribution bias of the original dataset on which the model training is based, Nvidia first performed a complete set of tests on their dataset ( 94B token) fine-tuned the unpruned 8B model. Experiments show that if distribution bias is not corrected, the teacher model provides suboptimal guidance for the dataset when distilling.

Depth-only pruning
To reduce from 8B to 4B, NVIDIA pruned 16 layers (50%). They first evaluate the importance of each layer or group of consecutive sub-layers by removing them from the model and observe an increase in LM loss or a decrease in accuracy in downstream tasks.
Figure 5 below shows the LM loss values on the validation set after removing 1, 2, 8 or 16 layers. For example, the red plot for layer 16 indicates the LM loss that occurs if the first 16 layers are removed. Layer 17 indicates that LM loss also occurs if the first layer is retained and layers 2 to 17 are deleted. Nvidia observes: The starting and ending layers are the most important. T Figure 5: The importance of depth-only pruning.
However, NVIDIA observes that this LM loss is not necessarily directly related to downstream performance.
Figure 6 below shows the Winogrande accuracy of each pruned model. It shows that it is best to delete the 16th to 31st layers, where the 31st layer is the penultimate layer. The 5-shot accuracy of the pruned model is significantly higher. at random accuracy (0.5). Nvidia adopted this insight and removed layers 16 through 31. Figure 6: Accuracy on the Winogrande task when 16 layers are removed.

Width-only pruning
^{NVIDIA prunes embedding (hidden) and MLP intermediate dimensions along the width axis to compress Llama 3.1 8B. Specifically, they use the previously described activation-based strategy to compute importance scores for each attention head, embedding channel, and MLP hidden dimension.}After importance estimation, NVIDIA chose

to prune the MLP middle dimension from 14336 to 9216.

Prune hidden size from 4096 to 3072.
Retrain attention to the number of heads and layers.

It is worth mentioning that after single-sample pruning, the LM loss of width pruning is higher than that of depth pruning. However, after a brief retraining period, the trend reversed.

Accuracy Benchmark

NVIDIA uses the following parameters to distill the model

#🎜🎜 #Peak learning rate = 1e-4

Minimum learning rate = 1e-5

40 step linear Warmup

Cosine Decay Plan

Global batch size = 1152

Table 1 below shows the Llama-3.1-Minitron 4B model variants (width pruning and depth pruning) compared to the original Llama 3.1 8B model and other similar sized models on benchmarks across multiple domains Performance comparison in tests. Overall, NVIDIA once again confirmed the effectiveness of a wide pruning strategy compared to deep pruning that follows best practices.

to Compare.
To verify whether the distilled model can become a powerful instruction model, NVIDIA used NeMo-Aligner to fine-tune the Llama-3.1-Minitron 4B model.
They used Nemotron-4 340B training data and evaluated on IFEval, MT-Bench, ChatRAG-Bench and Berkeley Function Calling Leaderboard (BFCL) to test instruction following, role playing, RAG and function call functions. Finally, it was confirmed that the Llama-3.1-Minitron 4B model can be a reliable instruction model, outperforming other baseline SLMs. #🎜🎜 ##### 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜#Table 2: Align the accuracy of alignment models with similar scale alignment models.

Performance Benchmark
^{NVIDIA leverages NVIDIA TensorRT-LLM, an open source tool for optimizing LLM inference Toolkit) optimized Llama 3.1 8B and Llama-3.1-Minitron 4B models.}The next two figures show the throughput requests per second of different models with FP8 and FP16 precision under different use cases, expressed as the input sequence length/output sequence length of the 8B model with a batch size of 32 ( ISL/OSL) combination as well as the input sequence length/output sequence length (ISL/OSL) combination with a batch size of 64 for the 4B model, thanks to the smaller weights allowing a larger batch size on an NVIDIA H100 80GB GPU .

The Llama-3.1-Minitron-4B-Depth-Base variant is the fastest, with an average throughput of about 2.7 times that of Llama 3.1 8B, while the Llama-3.1-Minitron-4B-Width-Base variant is the fastest. The average throughput of the variant is about 1.8 times that of Llama 3.1 8B. Deployment in FP8 also improves the performance of all three models by approximately 1.3x compared to BF16.

Figure 8: Combination: Llama 3.1 8B is BS =32, Llama-3.1-Minitron 4B model BS=64. 1x H100 80GB GPU.
Conclusion
^{Pruning and classic knowledge refining is a very cost-effective method that can be gradually Obtaining an LLM of smaller size can achieve higher accuracy than training it from scratch in all domains. This is a more efficient and data-efficient approach than fine-tuning on synthetic data or pre-training from scratch.}Llama-3.1-Minitron 4B is Nvidia’s first attempt at using the state-of-the-art open source Llama 3.1 series. To use the SDG fine-tuning of Llama-3.1 with NVIDIA NeMo, see the /sdg-law-title-generation section on GitHub.

For more information, please see the following resources:

https://arxiv.org/abs/2407.14679
# 🎜🎜#
https://github.com/NVlabs/Minitron

https://huggingface.co/nvidia/Llama- 3.1-Minitron-4B-Width-Base

https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base

Reference link:

https://developer.nvidia.com/blog /how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/

Atas ialah kandungan terperinci Nvidia bermain dengan pemangkasan dan penyulingan: memotong dua parameter Llama 3.1 8B untuk mencapai prestasi yang lebih baik dengan saiz yang sama. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

batch Token function github transformer https llama

Kenyataan：

Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn

Artikel sebelumnya：bagaimana copilot berfungsi dalam perkataan?Artikel seterusnya：bagaimana copilot berfungsi dalam perkataan?

Artikel berkaitan

Lihat lagi