


Taotian Group and Aicheng Technology cooperate to release the open source large-scale model training framework Megatron-LLaMA
On September 12, Taotian Group and Aicheng Technology officially open sourced the large model training framework - Megatron-LLaMA, aiming to allow technology developers to more conveniently improve the training performance of large language models and reduce training costs. And maintain compatibility with the LLaMA community. Tests show that in 32-card training, Megatron-LLaMA can achieve 176% acceleration compared to the code version directly obtained from HuggingFace; in large-scale training, Megatron-LLaMA has almost linear scalability compared to 32 cards. And shows a high tolerance for network instability. Currently, Megatron-LLaMA is online in the open source community.
Open source address: https://github.com/alibaba/Megatron-LLaMA

In 32-card training, compared to the code version obtained directly from HuggingFace, Megatron-LLaMA can achieve 176% acceleration; Even with the optimized version of DeepSpeed and FlashAttention, Megatron-LLaMA can still reduce training time by at least 19%. In large-scale training, Megatron-LLaMA has almost linear scalability compared to 32 cards. For example, using 512 A100 to reproduce the training of LLaMA-13B, the reverse mechanism of Megatron-LLaMA can save at least two days compared to the DistributedOptimizer of the native Megatron-LM without any loss of accuracy. -
Megatron-LLaMA exhibits a high tolerance for network instability. Even on the current cost-effective 8xA100-80GB training cluster with 4x200Gbps communication bandwidth (this environment is usually a mixed-deployment environment, the network can only use half of the bandwidth, the network bandwidth is a serious bottleneck, but the rental price is relatively low), Megatron-LLaMA can still achieve a linear expansion capability of 0.85, but Megatron-LM can only achieve less than 0.7 on this indicator. MEGATRON-LM technology brought high-performance LLAMA training opportunities ## Llama is currently a large language model open source community an important task. LLaMA introduces optimization technologies such as BPE character encoding, RoPE positional encoding, SwiGLU activation function, RMSNorm regularization, and Untied Embedding into the structure of LLM, and has achieved excellent results in many objective and subjective evaluations. LLaMA provides 7B, 13B, 30B, 65B/70B versions, which are suitable for various scenarios requiring large models, and are also favored by developers. Like many open source large models, since the official only provides the inference version of the code, there is no standard paradigm for how to carry out efficient training at the lowest cost. Megatron-LM is an elegant high-performance training solution.Megatron-LM provides tensor parallelism (Tensor Parallel, TP, which allocates large multiplications to multiple cards for parallel computing), pipeline parallelism (Pipeline Parallel, PP, which allocates different layers of the model to different cards for processing), and sequence parallelism (Sequence Parallel, SP, different parts of the sequence are processed by different cards, saving video memory), DistributedOptimizer optimization (similar to DeepSpeed Zero Stage-2, splitting gradient and optimizer parameters to all computing nodes) and other technologies can significantly reduce video memory usage and improve GPU utilization. Megatron-LM operates an active open source community, and new optimization technologies and functional designs continue to be incorporated into the framework. However, developing based on Megatron-LM is not simple, and debugging and functional verification on expensive multi-card machines is very expensive. Megatron-LLaMA first provides a set of LLaMA training code based on the Megatron-LM framework, supports model versions of various sizes, and can be easily adapted to support various variants of LLaMA, including direct support for the Tokenizer in the HuggingFace format. . Therefore, Megatron-LLaMA can be easily applied to existing offline training links without excessive adaptation. In small and medium-scale training/fine-tuning scenarios for LLaMA-7b and LLaMA-13b, Megatron-LLaMA can easily achieve industry-leading hardware utilization (MFU) of 54% and above.
MEGATRON-LLAMA's reverse process optimization## igue: DeepSpeed Zero Stage-2
################################################################## #####DeepSpeed ZeRO is a distributed training framework launched by Microsoft. The technology proposed in it has had a profound impact on many subsequent frameworks. DeepSpeed ZeRO Stage-2 (hereinafter referred to as ZeRO-2) is a technology in the framework that saves memory usage without adding additional calculation and communication workload. As shown in the figure above, due to calculation requirements, each Rank needs to have all parameters. But for the optimizer state, each Rank is only responsible for a part of it, and it is not necessary for all Ranks to perform completely repeated operations at the same time. Therefore, ZeRO-2 proposes to evenly divide the optimizer state into each Rank (note that there is no need to ensure that each variable is evenly divided or completely retained in a certain Rank). Each Rank only needs to be used during the training process. Responsible for updating the optimizer status and model parameters of the corresponding part. In this setting, gradients can also be split in this way. By default, ZeRO-2 uses the Reduce method to aggregate gradients among all Ranks in reverse, and then each Rank only needs to retain the part of the parameters it is responsible for, which not only eliminates redundant repeated calculations, but also reduces the memory usage. . ######### Megatron-LM DistributedOptimizer### ### Native Megatron-LM implements ZeRO-2-like gradient and optimizer state segmentation through DistributedOptimizer to reduce video memory usage during training. As shown in the figure above, DistributedOptimizer uses the ReduceScatter operator to distribute all the previously accumulated gradients to different Ranks after obtaining all the gradients aggregated by the preset gradient. Each Rank only obtains part of the gradient that it needs to process, and then updates the optimizer state and the corresponding parameters. Finally, each Rank obtains updated parameters from other nodes through AllGather, and finally obtains all parameters. The actual training results show that the gradient and parameter communication of Megatron-LM are performed in series with other calculations. For large-scale pre-training tasks, in order to ensure that the total batch data size remains unchanged, it is usually impossible to open a larger GA. Therefore, the proportion of communication will increase with the increase of machines. At this time, the characteristics of serial communication lead to very weak scalability. Within the community, the need is also acute. ###### ’s over over ‐ over ‐‐‐ under‐‐hum over‐ coming and re P P to to C to to C to to C on to do to have to do with to do with L P L ‐ ‐ ‐ LLaMA overlapped to do with. The operator can be parallelized with the calculation. In particular, compared to ZeRO's implementation, Megatron-LLaMA uses a more scalable collective communication method to improve scalability through clever optimization of the optimizer partitioning strategy under the premise of parallelism.The main design of OverlappedDistributedOptimizer ensures the following points: a) The data volume of a single set communication operator is large enough to fully utilize the communication bandwidth; b) The amount of communication data required by the new segmentation method should be equal to the minimum communication data volume required for data parallelism; c) During the conversion process of complete parameters or gradients and segmented parameters or gradients, too many video memory copies cannot be introduced.Specifically, Megatron-LLaMA improves the mechanism of DistributedOptimizer and proposes OverlappedDistributedOptimizer, which is used to optimize the reverse process in training in combination with the new segmentation method. As shown in the figure above, when OverlappedDistributedOptimizer is initialized, all parameters will be pre-allocated to the Bucket to which they belong. The parameters in a Bucket are complete. A parameter only belongs to one Bucket. There may be multiple parameters in a Bucket. Logically, each Bucket will be continuously divided into P (P is the number of data parallel groups) equal parts, and each Rank in the data parallel group is responsible for one of them. #Bucket is placed in a local queue (Local grad bucket queue) to ensure communication order. During training and calculation, data parallel groups exchange the gradients they need through collective communication in Bucket units. The implementation of Bucket in Megatron-LLaMA uses address indexing as much as possible, and only newly allocates space when the required value changes, avoiding waste of video memory. The above design, combined with a large number of engineering optimizations, allows Megatron-LLaMA to fully utilize the hardware during large-scale training, achieving better performance than the native Megatron-LM Better acceleration. When training from 32 A100 cards to 512 A100 cards, Megatron-LLaMA can still achieve an expansion ratio of 0.85 in a commonly used mixed network environment.
MEGATRON-LLAMA's Future Plan# MEGATRON-LLAMA is jointly open source and provide subsequent maintenance support by Taitian Group and Ai Orange Technology The training framework has been widely used internally. As more and more developers flock to LLaMA’s open source community and contribute experiences that can be learned from each other, I believe there will be more challenges and opportunities at the training framework level in the future. Megatron-LLaMA will pay close attention to the development of the community and work with developers to promote the following directions: Adaptive optimal configuration selection - More Support for model structure or local design changes
Extreme performance training solutions in more different types of hardware environments
Project address: https://github.com/alibaba/ Megatron-LLaMA
The above is the detailed content of Taotian Group and Aicheng Technology cooperate to release the open source large-scale model training framework Megatron-LLaMA. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.