Home  >  Article  >  Technology peripherals  >  Training big models pays attention to "energy"! Tao Dacheng leads the team: All the "efficient training" solutions are covered in one article, stop saying that hardware is the only bottleneck

Training big models pays attention to "energy"! Tao Dacheng leads the team: All the "efficient training" solutions are covered in one article, stop saying that hardware is the only bottleneck

WBOY
WBOYforward
2023-05-23 17:04:08699browse

The field of deep learning has made significant progress, especially in aspects such as computer vision, natural language processing and speech. Large-scale models trained using big data are important for practical applications, improving industrial productivity and promoting society. Development has huge prospects.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

However, large models also require large computing power to be trained. As people’s requirements for computing power continue to increase, ,Although there have been many studies exploring ,efficient training methods, there is still no comprehensive ,review on deep learning model acceleration techniques.

Recently, researchers from the University of Sydney, University of Science and Technology of China and other institutions published a review, comprehensively summarizing efficient training techniques for large-scale deep learning models and showing the training process Common mechanisms within each component in .

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

Paper link: https://arxiv.org/pdf/2304.03589.pdf

The researchers considered the most basic weight update formula and divided its basic components into five main aspects:

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

1, Data-centric (data-centric), including data set regularization, data sampling and data-centric course learning technology, can significantly Reduce the computational complexity of data samples;

2, Model-centric (model-centric), including acceleration of basic modules, compression training, model initialization and Model-centered course learning technology focuses on accelerating training by reducing parameter calculations;

3, Optimization-centric , including Selection of learning rate, use of large batch size, design of efficient objective function, model weighted average technology, etc.; focus on training strategies to improve the versatility of large-scale models;

4,Budgeted training, including some acceleration technologies used when hardware is limited;

5, system-centric (system- centric), including some efficient distributed frameworks and open source libraries, providing sufficient hardware support for the implementation of accelerated algorithms.

Efficient data-centric training

Recently, large-scale models have made great progress, while their requirements on data sets have increased dramatically. Huge data samples are used to drive the training process and achieve excellent performance. Therefore, data-centric research is critical to actual acceleration.

The basic function of data processing is to efficiently increase the diversity of data samples without increasing the cost of labeling; since the cost of data labeling is often too expensive, Some development institutions cannot afford it, which also highlights the importance of research in data-centric fields; at the same time, data processing also focuses on improving the efficiency of parallel loading of data samples.

The researchers call all these efficient processing of data a "data-centric" approach, which can significantly improve the performance of training large-scale models.

This article reviews and studies technology from the following aspects:

Data RegularizationData Regularization

Data regularization is a preprocessing technique that enhances the diversity of original data samples through a series of data transformations, which can improve the equivalence of training samples in the feature space. Indicates that no additional labeling information is required.

Efficient data regularization methods are widely used in the training process and can significantly improve the generalization performance of large-scale models.

Data samplingData sampling

Data sampling is also an effective method, from Selecting a subset from a large batch of samples to update the gradient has the advantage of training in small batches to reduce the impact of unimportant or bad samples in the current batch.

Usually, the sampled data is more important, and the performance is equivalent to that of the model trained using the full batch; the probability of each iteration needs to be gradually adjusted along with the training process. to ensure there is no bias in sampling.

Data-centric Curriculum Learning

Curriculum learning at different stages of the training process Investigate progressive training settings to reduce overall computational cost.

In the beginning, use low-quality data sets to train enough to learn low-level features; then use high-quality data sets (more enhancements and complex pre-processing methods) Gradually helps learn complex features and achieve the same accuracy as using the entire training set.

Model-centered efficient training

Designing an efficient model architecture is always one of the most important studies in the field of deep learning. An excellent model should be an efficient one. A feature extractor that can be projected into easily separated high-level features.

Different from other works that pay special attention to efficient and novel model architectures, this paper pays more attention to equivalent alternatives to common modules in "model-centric" research. Achieve higher training efficiency under comparable conditions.

Almost all large-scale models are composed of small modules or layers, so the investigation of models can provide guidance for efficient training of large-scale models. Researchers mainly focus on the following Research on aspects:

Architecture Efficiency

With the number of parameters in the deep model The sharp increase has also brought huge computational consumption, so it is necessary to implement an efficient alternative to approximate the performance of the original version of the model architecture. This direction has gradually attracted the attention of the academic community; this replacement is not only for numerical calculations Approximation, also includes structural simplification and fusion in deep models.

The researchers differentiate existing acceleration techniques based on different architectures and present some observations and conclusions.

Compression Training Efficiency

Compression has always been a research direction in computing acceleration. One, plays a key role in digital signal processing (multimedia computing/image processing).

Traditional compression includes two main branches: quantization and sparseness. The article details their existing achievements and contributions to deep training.

Initialization Efficiency

Initialization of model parameters in the existing theoretical analysis It is a very important factor in practical scenarios.

A bad initialization state can even cause the entire training to crash and stagnate in the early training phase, while a good initialization state helps speed up within a smooth loss range Regarding the entire convergence speed, this article mainly studies evaluation and algorithm design from the perspective of model initialization.

Model-centric Curriculum Learning

From a model-centric perspective, course learning usually starts training from a small model or partial parameters in a large-scale model, and then gradually recovers to the entire architecture; in the accelerated training process, it shows a larger Advantages, and no obvious negative effects, the article reviews the implementation and efficiency of this method in the training process.

Optimization-centered efficient learning

The acceleration scheme of optimization methods has always been an important research direction in the field of machine learning, which can reduce complexity while achieving optimal conditions. Sex has always been a pursuit in academia.

In recent years, efficient and powerful optimization methods have made important breakthroughs in training deep neural networks. As a basic optimizer widely used in machine learning, the SGD class optimizer has successfully It helps deep models realize various practical applications. However, as the problem becomes increasingly complex, SGD is more likely to fall into local minima and cannot generalize stably.

In order to solve these difficulties, Adam and its variants were proposed to introduce adaptability in updates. This approach has achieved good results in large-scale network training, such as It is used in BERT, Transformer and ViT models.

In addition to the performance of the designed optimizer itself, the combination of accelerated training techniques is also important.

Based on the perspective of optimization, researchers summarized the current thinking on accelerated training into the following aspects:

learning rate Learning rate

Learning rate is an important hyperparameter for non-convex optimization and is also crucial in current deep network training, like Adam Such adaptive methods and their variants have successfully achieved remarkable progress on deep models.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

# Some strategies for adjusting the learning rate based on high-order gradients also effectively achieve accelerated training, and the implementation of learning rate attenuation will also affect training performance in the process.

Large batch size

Using a larger batch size will effectively Improving training efficiency can directly reduce the number of iterations required to complete an epoch training; when the total number of samples is fixed, processing a larger batch size is less expensive than processing multiple small batch size samples, because it can Improve memory utilization and reduce communication bottlenecks.

Efficient objective

The most basic ERM is on the minimization problem Play a key role in making many tasks practical.

With the deepening of research on large networks, some works pay more attention to the gap between optimization and generalization, and propose effective goals to reduce test errors; explain generalization from different perspectives ization and jointly optimizing it during training can greatly speed up the accuracy of testing.

Weighted average Averaged weights

Weighted average is a practical technique that can Enhance the versatility of the model, because the weighted average of historical states is considered, and there is a set of frozen or learnable coefficients, which can greatly speed up the training process.

Budgetized and efficient training

There have been several recent efforts focused on training deep learning models with fewer resources and achieving higher accuracy as much as possible.

This type of problem is defined as budgeted training, that is, training within a given budget (a limit on measurable costs) to achieve the highest model performance.

In order to systematically consider hardware support to approach the real situation, the researchers defined budget training as training on a given device and within a limited time, for example, training on a single low-end deep learning server for one day, to get the model with the best performance.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

Research on budget training can shed light on how to create training recipes for budget training, including decisions about model size, model The configuration of the structure, learning rate arrangement and several other adjustable factors that affect performance, as well as the combination of efficient training techniques suitable for the available budget, this article mainly reviews several advanced techniques of budget training.

System-centered and efficient training

System-centered research is to provide specific implementation methods for the designed algorithms, and to study the ability to truly achieve high efficiency Efficient and practical execution of training hardware.

Researchers focus on the implementation of general-purpose computing devices, such as CPU and GPU devices in multi-node clusters, and resolving potential conflicts in design algorithms from a hardware perspective is the core of concern.

This article mainly reviews the hardware implementation technologies in existing frameworks and third-party libraries. These technologies effectively support the processing of data, models and optimization, and introduces some existing open source The platform provides a solid framework for model establishment, effective use of data for training, mixed precision training and distributed training.

System-centric Data Efficiency

## Efficient Data processing and data parallelism are two important concerns in system implementation.

With the rapid increase in data volume, inefficient data processing has gradually become a bottleneck for training efficiency, especially for large-scale training on multiple nodes. Design more hardware-friendly Computational methods and parallelization can effectively avoid wasting time in training.

System-centric Model Efficiency

With the rapid expansion of the number of model parameters ,From a model perspective, system efficiency has become ,one of the important bottlenecks, and the storage and ,computing efficiency of large-scale models brings huge ,challenges to hardware implementation.

This article mainly reviews how to achieve efficient I/O of deployment and streamlined implementation of model parallelism to speed up actual training.

System-centric Optimization Efficiency

The optimization process represents the The back propagation and update are also the most time-consuming calculations in training, so the implementation of system-centered optimization directly determines the efficiency of training.

In order to clearly interpret the characteristics of system optimization, the article focuses on the efficiency of different calculation stages and reviews the improvements of each process.

Open Source Frameworks

Efficient open source frameworks can facilitate training, as Grafting the bridge between algorithm design and hardware support, the researchers surveyed a range of open source frameworks and analyzed the strengths and weaknesses of each design.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

Conclusion

Researchers review common training acceleration techniques for efficient training of large-scale deep learning models , taking into account all components in the gradient update formula, covering the entire training process in the field of deep learning.

The article also proposes a novel taxonomy, which summarizes these technologies into five main directions: data-centric, model-centric, optimization-centric, budget training and system-centric .

The first four parts mainly conduct comprehensive research from the perspective of algorithm design and methodology, while in the "System-centered Efficient Training" part, we summarize from the perspective of paradigm innovation and hardware support actual implementation.

The article reviews and summarizes commonly used or recently developed technologies corresponding to each section, the advantages and trade-offs of each technology, and discusses limitations and promising future research directions. ; While providing a comprehensive technical review and guidance, this review also proposes current breakthroughs and bottlenecks in efficient training.

The researchers hope to help researchers achieve general training acceleration efficiently and provide some meaningful and promising implications for the future development of efficient training; In addition to the following at the end of each section In addition to some potential developments mentioned, the broader and promising views are as follows:

1. Efficient Profile search

Efficient training can design pre-built and customizable profile search strategies for the model from the perspectives of data enhancement combination, model structure, optimizer design, etc. Related research has achieved some results progress.

New model architectures and compression modes, new pre-training tasks, and the use of “model-edge” knowledge are also worth exploring.

2. Adaptive Scheduler Adaptive Scheduler

Use an optimization-oriented schedule Schedulers such as course learning, learning rate and batch size, as well as model complexity, may achieve better performance; Budget-aware schedulers can dynamically adapt to the remaining budget, reducing the cost of manual design; Adaptive schedulers can be used Explore parallelism and communication methods while taking into account more general and practical scenarios, such as large-scale decentralized training in heterogeneous networks spanning multiple regions and data centers.

The above is the detailed content of Training big models pays attention to "energy"! Tao Dacheng leads the team: All the "efficient training" solutions are covered in one article, stop saying that hardware is the only bottleneck. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete