


In recent times, "large models" have shone in various application scenarios in the AI field. Among them, the large-scale pre-training model based on Transformer is one of the most typical large models and has become the current basis. The core architecture of the Foundation Model. For example, BERT and GPT series in the NLP field, ViT and Swin Transformer series in the CV field, as well as the recently very popular multi-expert hybrid model MoE and multi-modal model CLIP, etc., all use Transformer as the core infrastructure. Correspondingly, such dense large models have parameters that often number in the billions, tens of billions, or even trillions. They face high computing, storage, and communication overhead, and also bring huge challenges to AI infrastructure.
In order to support the training of large models, people have developed many tools (such as Megatron proposed by NVIDIA, DeepSpeed proposed by Microsoft, FairSeq proposed by Meta, etc.) to achieve Various parallel methods, data parallelism, tensor model parallelism, pipeline parallelism, sharded data parallelism, etc. These systems provide good encapsulation of the above parallel methods and shield the corresponding implementation details from the outside, allowing users to implement hybrid parallel strategies by adding configurations.
Based on the above ideas, there has been a lot of work focusing on how to express various parallel methods at the tensor or operator level. The "automation" of this type of work is mainly reflected in the parallel API to Execution layer transformation process. However, if it is only limited to designing parallel APIs or intermediate expressions, this engineering encapsulation does not fundamentally solve the problem of distributed training. The most intuitive result is that users still cannot be liberated from the problems of distributed deployment. In fact, the distributed deployment of large models is a very complex problem. Most of the current distributed training systems rely on users' manual repeated attempts and the experience of system experts to deploy, causing serious problems of low resource utilization efficiency. , there is a considerable gap from true "automatic parallelism".
Based on this, the Beidahetu team proposed Galvatron, a distributed training artifact, to achieve efficient automatic parallelization of large models. The research paper was selected for the top international conference VLDB 2023.
- Paper address: https://arxiv.org/abs/2211.13878
- Project code link: https://github.com/PKU-DAIR/Hetu/tree/main/tools/Galvatron
What is the difficulty in automatic parallelization of large models
Researchers believe that the difficulties in automatic parallelization of large models are mainly reflected in the following three aspects:
(1) Diversity: First of all, in terms of parallel methods, the current parallel methods of large models are blooming. Even for the same operator, regardless of mixed parallel methods, different parallel methods There will also be significant differences in the underlying parallelism methods, resulting in different memory overhead, communication costs, and computational efficiency. The following figure shows the four most important basic parallel methods, namely Data Parallelism, Tensor Parallelism, Pipeline Parallelism, and Sharded Data Parallelism. The process of distributed execution of simple matrix multiplication operators on Zhang GPU.
Comparison diagram of parallel methods
Secondly, in terms of models, various model architectures have emerged in endlessly recently, which are often accompanied by different model configurations (such as different input sequence lengths, number of model layers, model hidden layer width, etc.), This results in a difference in computational load. In addition, in terms of hardware, users often face very differentiated cluster environments, and may face different memory capacities, communication bandwidths, computing capabilities, etc. Generally speaking, due to the above-mentioned diversity, no parallel technology can always achieve the best training efficiency, and "automatic parallelism" has become the core challenge of distributed training.
(2) Complexity: The above analysis is relatively simple. In fact, even for the same operator, multiple different basic parallel methods can be applied at the same time. If we take into account the combination of these basic parallel methods, The mixed parallel approach will cause the problem to become very complicated. More importantly, the calculation graph of a large model often has a very large structure, which requires a larger cluster. If each operator is explored (including selecting appropriate computing resources in the cluster and designing corresponding hybrid parallel methods ), will bring about the problem of combination space explosion, and it becomes difficult to find the optimal distributed execution plan for the entire model.
(3) Practicality: In addition, practicality is also a very important issue. On the one hand, in the process of automatic parallel search, for various distributed execution solutions, relatively accurate memory, communication, and computing overhead must be provided. Otherwise, the results will deviate too much from the actual execution, resulting in suboptimal solutions or being unable to do so at all. use. To this end, a very accurate cost model is needed to model different model structures and hardware conditions. On the other hand, the additional time overhead caused by the system's automatic parallel capabilities must be within an acceptable range, and excessively high search costs are also unacceptable.
Distributed training artifact Galvatron, one-click to realize efficient automatic parallelization of large models
System features:
In order to solve the above problems, researchers have proposed some series of works to explore hybrid parallel automatic search: one type of work mainly discusses the search space that considers both data parallelism and model parallelism. Representative works include FlexFlow, Tofu, and the other type of work mainly discusses the search space that considers both data parallelism and model parallelism. One type of work is generated from pipeline parallel scenarios and combines it with data parallelism. Representative works include PipeDream and DAPPLE. On this basis, there are also some derivative works, such as Unity and Alpa, which further expand the scope of automatic parallel exploration. The system "Galvatron" proposed by the Beidahetu team also belongs to the research field of automatic parallel search, but compared with existing work, this system mainly has the following three advantages:
(1) In terms of diversity, the parallel dimensions that existing work can support are still relatively limited, and Galvatron can not only support more parallel dimensions, but also be able to accurately model the more differentiated Transformer model structure, and in Its adaptive tuning capabilities have been verified under different cluster hardware conditions.
Comparison diagram of large model distributed training system
(2) In terms of complexity, due to its advantages in diversity, Galvatron faces an unprecedentedly large search space. To this end, researchers have explored several processes in the current large-scale distributed training process. Important observations, experimentally or theoretically verified, serve as pruning criteria for the search space, thereby enabling efficient distributed execution plan optimization.
(3) In terms of practicality, this research combines the advantages of theoretical modeling and experimental measurement to achieve accurate estimates of memory, communication, and computing overhead, and even takes into account The problem of reduced GPU execution efficiency caused by overlapping computation and communication ensures that sufficiently accurate automatic parallel optimization results can be obtained.
In addition, Galvatron chooses PyTorch as the execution engine at the bottom layer, which is compatible with common mainstream Transformer model implementations such as Huggingface, so it will not bring any additional burden to PyTorch users; at the same time, it does not require Users pay additional system installation or debugging costs, and only need to add a few lines of code when using it, and the entire process of automatic parallelization can be easily completed.
Galvatron workflow and user interface display
Key technologies
1. Search space decomposition based on decision tree
The design goal of Galvatron is to efficiently automatically search within a complex and large parallel policy space and generate the optimal parallel execution plan for a given Transformer model and distributed environment. In terms of search space, Galvatron is the first automatic parallel training system in the industry that considers four mainstream parallel methods, including data parallelism (DP), sharded data parallelism (SDP), and tensor parallelism (tensor parallelism). parallelism (TP) and pipeline parallelism (PP). Since the hybrid parallel strategy will include any combination of the above four parallel algorithms, the search space brought by this combination is very large in a multi-GPU scenario. For example, in a dual-machine four-card scenario, one feasible strategy is to use 2-way TP within the machine and 2-way PP between machines. Another feasible strategy is to use 2-way PP within the machine and between machines. Use 2-way DP. When the number of GPUs in a node is expanded to 8 cards, there are hundreds of candidate strategies for each layer of the model. As the number of model layers increases, the size of its search space increases exponentially, making it difficult to explore effectively.
To efficiently search such a huge search space, the study first proposes the following observations as a guide:
- Takeway#1 :PP tends to be placed across device islands. Here "device island" refers to a group of devices with high internal bandwidth. In most Transformer models, the communication volume of PP is significantly less compared to other parallel methods. Therefore, people usually prioritize PP slicing the model and placing it between islands of equipment.
- Takeway#2: Under the premise of homogeneous devices, the parallel strategy tends to divide the devices evenly. For example, 2-way DP for a 4-card GPU will tend to split the device into two sets of 2-card devices, rather than a set of 1-card and a set of 3-card devices. In this case, the optimal hybrid parallelism policy within one device group is consistent with the optimal policy within other groups.
-
Takeway#3: Generally speaking, when you can mix DP and SDP, using only SDP is theoretically better. According to the analysis results, the communication overhead and memory overhead of N-way SDP are better than the combination of
and
, where
.
Based on the above important observations, this study proposes a search space construction method based on decision trees:
(1) Given a Transformer model, based on Takeway#1 and Takeway#2, Galvatron first uses PP to divide the model into multiple stages, and at the same time divide the equipment into multiple equipment groups evenly and continuously. For example, in the 8-card scenario, the model is divided into 1/2/4/8-way PP, corresponding to the device group size of 8/4/2/1 respectively.
(2) Each PP segmentation corresponds to a decision tree and a sub-search space. The total number of decision leaf nodes is the size of the device group, and the height of the decision tree is the available parallel methods. Number, that is, one parallel strategy can be applied to each level of the decision tree.
(3) Parallel strategies cannot be reused between different levels of the decision tree.
(4) The degree of non-leaf nodes is selected from the exponential power of 2 {2,4,8,…} by default.
Based on the above decision tree construction rules, the decision tree constructed by Galvatron can represent any combination of the above parallelism. Takeway#1 and Takeway#2 help Galvatron avoid inefficient parallel combinations and reduce the search space. For the scenario of training a one-layer model on an 8-card GPU, the above rules will produce 34 candidate hybrid parallel strategies. Furthermore, after using Takeway#3 to prune the situation where DP and SDP appear in a decision tree at the same time, the number of 8-card candidate strategies is reduced to 22.
The following figure shows a schematic diagram of the decision tree under different PP parallelism (8/4/2/1) in the 8-card GPU scenario.
8 Schematic diagram of decision tree under different PP parallelism (8/4/2/1) in card GPU scenario
2 . Parallel optimization algorithm based on dynamic programming
Existing systems such as Megatron or DeepSpeed usually specify the global parallel scheme and its corresponding degree of parallelism by the user, which severely limits the use of distributed Ability to express plans for execution. The optimization goal of Galvatron is to automatically generate the optimal distributed execution plan without the user specifying any parallel configuration when the user is given a model definition and distributed environment. Specifically, given an L-layer model M and N GPU devices with memory capacity E, the optimization goal of Galvatron is to search for the highest system throughput T_pt and return the corresponding parallel solution. The parallel solution here refers to the layer ( or operator) as the basic unit of a fine-grained hybrid parallel strategy.
Algorithm 1: Galvatron optimization process
Optimization process: The optimization process of Galvatron is shown in Algorithm 1. Galvatron's outermost loop gradually increases the search batch size until it exceeds the device memory; given each candidate batch size B, Galvatron first splits the model PP according to Takeaway#1 and searches for different degrees of parallelism P (line 4), After selecting P-way PP, the model is divided into P stages (line 6), and all the corresponding equipment is divided into P groups, each group containing N/P equipment; then Galvatron builds the corresponding decision tree , which can represent any combination of DP, SDP, and TP without duplication or omission, thereby obtaining the strategy set S; then for each model stage M_i, under the device memory limit E, Galvatron uses dynamic programming search to obtain each layer The optimal hybrid parallel strategy and returns the minimum time cost (line 9); finally, Galvatron selects the strategy with the highest throughput among all possible PP parallelism and batch size and returns it (line 15).
Dynamic programming search: The following introduces the dynamic programming search algorithm in the Galvatron parallel optimization workflow. For a given model stage containing L layers, the cost function C(L,E) is used to represent the total execution time of the L layer model under the device memory limit E, and represents the execution time of the L layer using the strategy S_j, where the strategy S_j is the strategy in the parallel strategy candidate set S. Setting the initial value
, Galvatron’s dynamic programming search follows the following state transition equation (Formula 1):
Among them, is the memory overhead of layer L using strategy S_j, is the conversion overhead caused by the L-th layer using strategy S_j and the previous layer using strategy S_i. During the state transfer process, when the memory overhead exceeds the device memory limit device memory limit E, the overhead function C returns infinity.
Complexity analysis: The computational complexity of the dynamic programming search (Formula 1) used by Galvatron is O(LE|S|). It can be seen that the size of the search space S of each layer is very important to the overall search complexity. The search space decomposition based on the decision tree proposed by Galvatron can significantly reduce the search space and control the search overhead within a reasonable range.
3. Execution cost estimation method based on hybrid modeling
Galvatron uses a policy cost estimation module to estimate the computing, communication, and memory costs of hybrid parallel strategies. Existing cost estimation methods mainly include measurement (profiling) and simulation (simulating). Galvatron draws on the strengths of both and designs a cost-effective, efficient and accurate cost estimation method. Specifically, for memory overhead, Galvatron uses the shape and data type of the tensor to complete the estimation; for calculation time, Galvatron measures the sample-by-sample calculation time through profiling on a single device, combining the batch size and fitting function to estimate the overall calculation Time; for communication time, Galvatron obtains the estimated communication time by dividing the communication volume by the device communication bandwidth, where the communication volume is calculated theoretically and the communication bandwidth is measured by profiling.
Based on the above estimation results, Galvatron calculates the cost c(l,s) of a given layer using a given strategy through the simulating execution process. Different from the cost model of existing distributed training systems, Galvatron considers the impact of overlapping computing and communication on GPU performance degradation for the first time in the modeling. This study experimentally found that GPU performance degradation due to overlap can significantly affect the execution efficiency of the system, which has been ignored in previous work. As a result, Galvatron's cost estimates are more accurate and parallel optimization is better.
Experimental results
Experimental settings: In the experiment, the researcher combined Galvatron and four baseline systems (DP, SDP, TP, PP) using a single parallel strategy For comparison with DeepSpeed 3D Parallelism set by experts, two additional weakened versions of Galvatron were set up as auxiliary baselines to carry out automatic parallel search in a limited parallel strategy combination space (i.e. TP DP, PP DP). This study selected the NLP Transformer models BERT and T5, and the CV Transformer models ViT and Swin Transformer as experimental objects.
Comparison of throughput between Galvatron and baseline systems under 8 GPUs and 20G video memory
Experimental comparison effect: This study first conducted experiments in the eight-card Nvidia RTX TITAN 24GB environment. Experiments show that under different model sizes and different memory constraints, Galvatron always achieves the optimal throughput, and compared with the existing state-of-the-art single parallel methods and hybrid parallel methods, the training throughput is significantly improved. . Specifically, on the ViT model, Galvatron's throughput acceleration ratio can reach up to 338% compared to a single strategy, and its throughput acceleration ratio can reach up to 55% compared to other hybrid parallel strategies; in the other three models Compared with single strategy and existing mixed strategy, Galvatron has an acceleration ratio of up to 200%-334% and 28%-52%.
##Illustration of the partial optimal parallel strategy obtained by Galvatron search
Interpretability Experiment: This study selected some optimal parallel strategies obtained by Galvatron search for display. For the BERT model in the case of 8GB (Case A), Galvatron chose two hybrid parallel strategies PP-TP-DP and PP-TP-SDP. When the available video memory increased to 12GB, Galvatron gave up PP and chose to use more Multiple DPs, and SDP is introduced to save video memory space. The situation is slightly different on Swin Transformer. Different layers of the model show obvious heterogeneity. When the memory is relatively scarce (Case C), the parallelism of shallow SDP is higher. As the number of layers increases, each layer The activation becomes smaller and the parameters become more, so TP gradually replaces SDP. When the video memory increases (Case D), not only is PP re-enabled to replace part of the inefficient SDP, but the shallow layer tends to use DP more obviously.
Scalability Experiment: The study further tested Galvatron on larger clusters, including an environment with 16 Nvidia RTX TITAN GPUs and a 64-card Nvidia A100 GPUs environment. In the 16-card environment, Galvatron still has the highest throughput compared to other strategies. Compared with the experimental results of 8 cards with the same memory limit, due to the more diverse hybrid parallel strategy, Galvatron can obtain more than 2 times the throughput on 16 cards. Speedup ratio. In the 64-card experiment, Galvatron also had the highest throughput rate among other strategies. This shows that Galvatron has good scalability. Detailed results can be found in the original paper.
Beida Hetu Team Introduction
The Hetu development team comes from the Data and Intelligence Research Lab at Peking University (hereinafter referred to as the laboratory). The laboratory is led by Professor Cui Bin from the School of Computer Science at Peking University. Over the years, it has mainly conducted cutting-edge research in the fields of artificial intelligence and big data. It has achieved many results in theoretical and technological innovation and system research and development, and has been published in top international academic conferences and journals. More than 100 academic papers.
Hetu system is a distributed deep learning system for very large models. Compared with the existing old distributed deep learning framework, it has advantages in system functionality, system complexity and system ease. It has many innovative contributions in terms of usability, such as automatic distributed parallel strategies, consistency protocols and communication architectures, GPU operator optimization, etc. The Hetu team has currently carried out academic innovations in a variety of distributed machine learning or deep learning scenarios, and relevant results have been included in top international conferences such as SIGMOD, VLDB, ICML, KDD, etc. Among them, the sparse large model distributed training system HET won the VLDB 2022 Best Prize. Jia Ke Scalable Data Science Paper Award. Galvatron, the paper accepted by VLDB 2023, is another breakthrough achieved by the Hetu team in dense large model distributed training scenarios. It has been integrated into the Hetu system and is open source. At present, the Hetu team has carried out scientific research cooperation and application implementation with many well-known companies such as Tencent, Alibaba, Kuaishou, and ByteDance.
The above is the detailed content of Beidahetu releases Galvatron, a distributed training artifact, to realize efficient and automatic parallelization of large models with one click. For more information, please follow other related articles on the PHP Chinese website!

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:https://spj.scien

译者 | 李睿审校 | 孙淑娟近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version
Visual web development tools

Notepad++7.3.1
Easy-to-use and free code editor

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
