Home  >  Article  >  Technology peripherals  >  Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

WBOY
WBOYforward
2023-04-11 21:10:011143browse

In recent times, "large models" have shone in various application scenarios in the AI ​​field. Among them, the large-scale pre-training model based on Transformer is one of the most typical large models and has become the current basis. The core architecture of the Foundation Model. For example, BERT and GPT series in the NLP field, ViT and Swin Transformer series in the CV field, as well as the recently very popular multi-expert hybrid model MoE and multi-modal model CLIP, etc., all use Transformer as the core infrastructure. Correspondingly, such dense large models have parameters that often number in the billions, tens of billions, or even trillions. They face high computing, storage, and communication overhead, and also bring huge challenges to AI infrastructure.

In order to support the training of large models, people have developed many tools (such as Megatron proposed by NVIDIA, DeepSpeed ​​proposed by Microsoft, FairSeq proposed by Meta, etc.) to achieve Various parallel methods, data parallelism, tensor model parallelism, pipeline parallelism, sharded data parallelism, etc. These systems provide good encapsulation of the above parallel methods and shield the corresponding implementation details from the outside, allowing users to implement hybrid parallel strategies by adding configurations.

Based on the above ideas, there has been a lot of work focusing on how to express various parallel methods at the tensor or operator level. The "automation" of this type of work is mainly reflected in the parallel API to Execution layer transformation process. However, if it is only limited to designing parallel APIs or intermediate expressions, this engineering encapsulation does not fundamentally solve the problem of distributed training. The most intuitive result is that users still cannot be liberated from the problems of distributed deployment. In fact, the distributed deployment of large models is a very complex problem. Most of the current distributed training systems rely on users' manual repeated attempts and the experience of system experts to deploy, causing serious problems of low resource utilization efficiency. , there is a considerable gap from true "automatic parallelism".

Based on this, the Beidahetu team proposed Galvatron, a distributed training artifact, to achieve efficient automatic parallelization of large models. The research paper was selected for the top international conference VLDB 2023.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

  • Paper address: https://arxiv.org/abs/2211.13878
  • Project code link: https://github.com/PKU-DAIR/Hetu/tree/main/tools/Galvatron

What is the difficulty in automatic parallelization of large models

Researchers believe that the difficulties in automatic parallelization of large models are mainly reflected in the following three aspects:

(1) Diversity: First of all, in terms of parallel methods, the current parallel methods of large models are blooming. Even for the same operator, regardless of mixed parallel methods, different parallel methods There will also be significant differences in the underlying parallelism methods, resulting in different memory overhead, communication costs, and computational efficiency. The following figure shows the four most important basic parallel methods, namely Data Parallelism, Tensor Parallelism, Pipeline Parallelism, and Sharded Data Parallelism. The process of distributed execution of simple matrix multiplication operators on Zhang GPU.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Comparison diagram of parallel methods

Secondly, in terms of models, various model architectures have emerged in endlessly recently, which are often accompanied by different model configurations (such as different input sequence lengths, number of model layers, model hidden layer width, etc.), This results in a difference in computational load. In addition, in terms of hardware, users often face very differentiated cluster environments, and may face different memory capacities, communication bandwidths, computing capabilities, etc. Generally speaking, due to the above-mentioned diversity, no parallel technology can always achieve the best training efficiency, and "automatic parallelism" has become the core challenge of distributed training.

(2) Complexity: The above analysis is relatively simple. In fact, even for the same operator, multiple different basic parallel methods can be applied at the same time. If we take into account the combination of these basic parallel methods, The mixed parallel approach will cause the problem to become very complicated. More importantly, the calculation graph of a large model often has a very large structure, which requires a larger cluster. If each operator is explored (including selecting appropriate computing resources in the cluster and designing corresponding hybrid parallel methods ), will bring about the problem of combination space explosion, and it becomes difficult to find the optimal distributed execution plan for the entire model.

(3) Practicality: In addition, practicality is also a very important issue. On the one hand, in the process of automatic parallel search, for various distributed execution solutions, relatively accurate memory, communication, and computing overhead must be provided. Otherwise, the results will deviate too much from the actual execution, resulting in suboptimal solutions or being unable to do so at all. use. To this end, a very accurate cost model is needed to model different model structures and hardware conditions. On the other hand, the additional time overhead caused by the system's automatic parallel capabilities must be within an acceptable range, and excessively high search costs are also unacceptable.

Distributed training artifact Galvatron, one-click to realize efficient automatic parallelization of large models

System features:

In order to solve the above problems, researchers have proposed some series of works to explore hybrid parallel automatic search: one type of work mainly discusses the search space that considers both data parallelism and model parallelism. Representative works include FlexFlow, Tofu, and the other type of work mainly discusses the search space that considers both data parallelism and model parallelism. One type of work is generated from pipeline parallel scenarios and combines it with data parallelism. Representative works include PipeDream and DAPPLE. On this basis, there are also some derivative works, such as Unity and Alpa, which further expand the scope of automatic parallel exploration. The system "Galvatron" proposed by the Beidahetu team also belongs to the research field of automatic parallel search, but compared with existing work, this system mainly has the following three advantages:

(1) In terms of diversity, the parallel dimensions that existing work can support are still relatively limited, and Galvatron can not only support more parallel dimensions, but also be able to accurately model the more differentiated Transformer model structure, and in Its adaptive tuning capabilities have been verified under different cluster hardware conditions.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Comparison diagram of large model distributed training system

(2) In terms of complexity, due to its advantages in diversity, Galvatron faces an unprecedentedly large search space. To this end, researchers have explored several processes in the current large-scale distributed training process. Important observations, experimentally or theoretically verified, serve as pruning criteria for the search space, thereby enabling efficient distributed execution plan optimization.

(3) In terms of practicality, this research combines the advantages of theoretical modeling and experimental measurement to achieve accurate estimates of memory, communication, and computing overhead, and even takes into account The problem of reduced GPU execution efficiency caused by overlapping computation and communication ensures that sufficiently accurate automatic parallel optimization results can be obtained.

In addition, Galvatron chooses PyTorch as the execution engine at the bottom layer, which is compatible with common mainstream Transformer model implementations such as Huggingface, so it will not bring any additional burden to PyTorch users; at the same time, it does not require Users pay additional system installation or debugging costs, and only need to add a few lines of code when using it, and the entire process of automatic parallelization can be easily completed.

Galvatron workflow and user interface display

Key technologies

1. Search space decomposition based on decision tree

The design goal of Galvatron is to efficiently automatically search within a complex and large parallel policy space and generate the optimal parallel execution plan for a given Transformer model and distributed environment. In terms of search space, Galvatron is the first automatic parallel training system in the industry that considers four mainstream parallel methods, including data parallelism (DP), sharded data parallelism (SDP), and tensor parallelism (tensor parallelism). parallelism (TP) and pipeline parallelism (PP). Since the hybrid parallel strategy will include any combination of the above four parallel algorithms, the search space brought by this combination is very large in a multi-GPU scenario. For example, in a dual-machine four-card scenario, one feasible strategy is to use 2-way TP within the machine and 2-way PP between machines. Another feasible strategy is to use 2-way PP within the machine and between machines. Use 2-way DP. When the number of GPUs in a node is expanded to 8 cards, there are hundreds of candidate strategies for each layer of the model. As the number of model layers increases, the size of its search space increases exponentially, making it difficult to explore effectively.

To efficiently search such a huge search space, the study first proposes the following observations as a guide:

  • Takeway#1 :PP tends to be placed across device islands. Here "device island" refers to a group of devices with high internal bandwidth. In most Transformer models, the communication volume of PP is significantly less compared to other parallel methods. Therefore, people usually prioritize PP slicing the model and placing it between islands of equipment.
  • Takeway#2: Under the premise of homogeneous devices, the parallel strategy tends to divide the devices evenly. For example, 2-way DP for a 4-card GPU will tend to split the device into two sets of 2-card devices, rather than a set of 1-card and a set of 3-card devices. In this case, the optimal hybrid parallelism policy within one device group is consistent with the optimal policy within other groups.
  • Takeway#3: Generally speaking, when you can mix DP and SDP, using only SDP is theoretically better. According to the analysis results, the communication overhead and memory overhead of N-way SDP are better than the combination of Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click and Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click, where Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click.

Based on the above important observations, this study proposes a search space construction method based on decision trees:

(1) Given a Transformer model, based on Takeway#1 and Takeway#2, Galvatron first uses PP to divide the model into multiple stages, and at the same time divide the equipment into multiple equipment groups evenly and continuously. For example, in the 8-card scenario, the model is divided into 1/2/4/8-way PP, corresponding to the device group size of 8/4/2/1 respectively.

(2) Each PP segmentation corresponds to a decision tree and a sub-search space. The total number of decision leaf nodes is the size of the device group, and the height of the decision tree is the available parallel methods. Number, that is, one parallel strategy can be applied to each level of the decision tree.

(3) Parallel strategies cannot be reused between different levels of the decision tree.

(4) The degree of non-leaf nodes is selected from the exponential power of 2 {2,4,8,…} by default.

Based on the above decision tree construction rules, the decision tree constructed by Galvatron can represent any combination of the above parallelism. Takeway#1 and Takeway#2 help Galvatron avoid inefficient parallel combinations and reduce the search space. For the scenario of training a one-layer model on an 8-card GPU, the above rules will produce 34 candidate hybrid parallel strategies. Furthermore, after using Takeway#3 to prune the situation where DP and SDP appear in a decision tree at the same time, the number of 8-card candidate strategies is reduced to 22.

The following figure shows a schematic diagram of the decision tree under different PP parallelism (8/4/2/1) in the 8-card GPU scenario.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

8 Schematic diagram of decision tree under different PP parallelism (8/4/2/1) in card GPU scenario

2 . Parallel optimization algorithm based on dynamic programming

Existing systems such as Megatron or DeepSpeed ​​usually specify the global parallel scheme and its corresponding degree of parallelism by the user, which severely limits the use of distributed Ability to express plans for execution. The optimization goal of Galvatron is to automatically generate the optimal distributed execution plan without the user specifying any parallel configuration when the user is given a model definition and distributed environment. Specifically, given an L-layer model M and N GPU devices with memory capacity E, the optimization goal of Galvatron is to search for the highest system throughput T_pt and return the corresponding parallel solution. The parallel solution here refers to the layer ( or operator) as the basic unit of a fine-grained hybrid parallel strategy.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Algorithm 1: Galvatron optimization process

Optimization process: The optimization process of Galvatron is shown in Algorithm 1. Galvatron's outermost loop gradually increases the search batch size until it exceeds the device memory; given each candidate batch size B, Galvatron first splits the model PP according to Takeaway#1 and searches for different degrees of parallelism P (line 4), After selecting P-way PP, the model is divided into P stages (line 6), and all the corresponding equipment is divided into P groups, each group containing N/P equipment; then Galvatron builds the corresponding decision tree , which can represent any combination of DP, SDP, and TP without duplication or omission, thereby obtaining the strategy set S; then for each model stage M_i, under the device memory limit E, Galvatron uses dynamic programming search to obtain each layer The optimal hybrid parallel strategy and returns the minimum time cost (line 9); finally, Galvatron selects the strategy with the highest throughput among all possible PP parallelism and batch size and returns it (line 15).

Dynamic programming search: The following introduces the dynamic programming search algorithm in the Galvatron parallel optimization workflow. For a given model stage containing L layers, the cost function C(L,E) is used to represent the total execution time of the L layer model under the device memory limit E, and represents the execution time of the L layer using the strategy S_j, where the strategy S_j is the strategy in the parallel strategy candidate set S. Setting the initial value

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

, Galvatron’s dynamic programming search follows the following state transition equation (Formula 1):

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Among them, is the memory overhead of layer L using strategy S_j, is the conversion overhead caused by the L-th layer using strategy S_j and the previous layer using strategy S_i. During the state transfer process, when the memory overhead exceeds the device memory limit device memory limit E, the overhead function C returns infinity.

Complexity analysis: The computational complexity of the dynamic programming search (Formula 1) used by Galvatron is O(LE|S|). It can be seen that the size of the search space S of each layer is very important to the overall search complexity. The search space decomposition based on the decision tree proposed by Galvatron can significantly reduce the search space and control the search overhead within a reasonable range.

3. Execution cost estimation method based on hybrid modeling

Galvatron uses a policy cost estimation module to estimate the computing, communication, and memory costs of hybrid parallel strategies. Existing cost estimation methods mainly include measurement (profiling) and simulation (simulating). Galvatron draws on the strengths of both and designs a cost-effective, efficient and accurate cost estimation method. Specifically, for memory overhead, Galvatron uses the shape and data type of the tensor to complete the estimation; for calculation time, Galvatron measures the sample-by-sample calculation time through profiling on a single device, combining the batch size and fitting function to estimate the overall calculation Time; for communication time, Galvatron obtains the estimated communication time by dividing the communication volume by the device communication bandwidth, where the communication volume is calculated theoretically and the communication bandwidth is measured by profiling.

Based on the above estimation results, Galvatron calculates the cost c(l,s) of a given layer using a given strategy through the simulating execution process. Different from the cost model of existing distributed training systems, Galvatron considers the impact of overlapping computing and communication on GPU performance degradation for the first time in the modeling. This study experimentally found that GPU performance degradation due to overlap can significantly affect the execution efficiency of the system, which has been ignored in previous work. As a result, Galvatron's cost estimates are more accurate and parallel optimization is better.

Experimental results

Experimental settings: In the experiment, the researcher combined Galvatron and four baseline systems (DP, SDP, TP, PP) using a single parallel strategy For comparison with DeepSpeed ​​3D Parallelism set by experts, two additional weakened versions of Galvatron were set up as auxiliary baselines to carry out automatic parallel search in a limited parallel strategy combination space (i.e. TP DP, PP DP). This study selected the NLP Transformer models BERT and T5, and the CV Transformer models ViT and Swin Transformer as experimental objects.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

Comparison of throughput between Galvatron and baseline systems under 8 GPUs and 20G video memory

Experimental comparison effect: This study first conducted experiments in the eight-card Nvidia RTX TITAN 24GB environment. Experiments show that under different model sizes and different memory constraints, Galvatron always achieves the optimal throughput, and compared with the existing state-of-the-art single parallel methods and hybrid parallel methods, the training throughput is significantly improved. . Specifically, on the ViT model, Galvatron's throughput acceleration ratio can reach up to 338% compared to a single strategy, and its throughput acceleration ratio can reach up to 55% compared to other hybrid parallel strategies; in the other three models Compared with single strategy and existing mixed strategy, Galvatron has an acceleration ratio of up to 200%-334% and 28%-52%.

Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click

##Illustration of the partial optimal parallel strategy obtained by Galvatron search

Interpretability Experiment: This study selected some optimal parallel strategies obtained by Galvatron search for display. For the BERT model in the case of 8GB (Case A), Galvatron chose two hybrid parallel strategies PP-TP-DP and PP-TP-SDP. When the available video memory increased to 12GB, Galvatron gave up PP and chose to use more Multiple DPs, and SDP is introduced to save video memory space. The situation is slightly different on Swin Transformer. Different layers of the model show obvious heterogeneity. When the memory is relatively scarce (Case C), the parallelism of shallow SDP is higher. As the number of layers increases, each layer The activation becomes smaller and the parameters become more, so TP gradually replaces SDP. When the video memory increases (Case D), not only is PP re-enabled to replace part of the inefficient SDP, but the shallow layer tends to use DP more obviously.

Scalability Experiment: The study further tested Galvatron on larger clusters, including an environment with 16 Nvidia RTX TITAN GPUs and a 64-card Nvidia A100 GPUs environment. In the 16-card environment, Galvatron still has the highest throughput compared to other strategies. Compared with the experimental results of 8 cards with the same memory limit, due to the more diverse hybrid parallel strategy, Galvatron can obtain more than 2 times the throughput on 16 cards. Speedup ratio. In the 64-card experiment, Galvatron also had the highest throughput rate among other strategies. This shows that Galvatron has good scalability. Detailed results can be found in the original paper.

Beida Hetu Team Introduction

The Hetu development team comes from the Data and Intelligence Research Lab at Peking University (hereinafter referred to as the laboratory). The laboratory is led by Professor Cui Bin from the School of Computer Science at Peking University. Over the years, it has mainly conducted cutting-edge research in the fields of artificial intelligence and big data. It has achieved many results in theoretical and technological innovation and system research and development, and has been published in top international academic conferences and journals. More than 100 academic papers.

Hetu system is a distributed deep learning system for very large models. Compared with the existing old distributed deep learning framework, it has advantages in system functionality, system complexity and system ease. It has many innovative contributions in terms of usability, such as automatic distributed parallel strategies, consistency protocols and communication architectures, GPU operator optimization, etc. The Hetu team has currently carried out academic innovations in a variety of distributed machine learning or deep learning scenarios, and relevant results have been included in top international conferences such as SIGMOD, VLDB, ICML, KDD, etc. Among them, the sparse large model distributed training system HET won the VLDB 2022 Best Prize. Jia Ke Scalable Data Science Paper Award. Galvatron, the paper accepted by VLDB 2023, is another breakthrough achieved by the Hetu team in dense large model distributed training scenarios. It has been integrated into the Hetu system and is open source. At present, the Hetu team has carried out scientific research cooperation and application implementation with many well-known companies such as Tencent, Alibaba, Kuaishou, and ByteDance.

The above is the detailed content of Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete