Training big models pays attention to 'energy'! Tao Dacheng leads the team: All the 'efficient training' solutions are covered in one article, stop saying that hardware is the only bottleneck-AI-php.cn

Training big models pays attention to 'energy'! Tao Dacheng leads the team: All the 'efficient training' solutions are covered in one article, stop saying that hardware is the only bottleneck

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 23, 2023 pm 05:04 PM

Modeltrain

The field of deep learning has made significant progress, especially in aspects such as computer vision, natural language processing and speech. Large-scale models trained using big data are important for practical applications, improving industrial productivity and promoting society. Development has huge prospects.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

However, large models also require large computing power to be trained. As people’s requirements for computing power continue to increase, ,Although there have been many studies exploring ,efficient training methods, there is still no comprehensive ,review on deep learning model acceleration techniques.

Recently, researchers from the University of Sydney, University of Science and Technology of China and other institutions published a review, comprehensively summarizing efficient training techniques for large-scale deep learning models and showing the training process Common mechanisms within each component in .

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

Paper link: https://arxiv.org/pdf/2304.03589.pdf

The researchers considered the most basic weight update formula and divided its basic components into five main aspects:

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

1, Data-centric (data-centric), including data set regularization, data sampling and data-centric course learning technology, can significantly Reduce the computational complexity of data samples;

2, Model-centric (model-centric), including acceleration of basic modules, compression training, model initialization and Model-centered course learning technology focuses on accelerating training by reducing parameter calculations;

3, Optimization-centric , including Selection of learning rate, use of large batch size, design of efficient objective function, model weighted average technology, etc.; focus on training strategies to improve the versatility of large-scale models;

4,Budgeted training, including some acceleration technologies used when hardware is limited;

5, system-centric (system- centric), including some efficient distributed frameworks and open source libraries, providing sufficient hardware support for the implementation of accelerated algorithms.

Efficient data-centric training

Recently, large-scale models have made great progress, while their requirements on data sets have increased dramatically. Huge data samples are used to drive the training process and achieve excellent performance. Therefore, data-centric research is critical to actual acceleration.

The basic function of data processing is to efficiently increase the diversity of data samples without increasing the cost of labeling; since the cost of data labeling is often too expensive, Some development institutions cannot afford it, which also highlights the importance of research in data-centric fields; at the same time, data processing also focuses on improving the efficiency of parallel loading of data samples.

The researchers call all these efficient processing of data a "data-centric" approach, which can significantly improve the performance of training large-scale models.

This article reviews and studies technology from the following aspects:

Data RegularizationData Regularization

Data regularization is a preprocessing technique that enhances the diversity of original data samples through a series of data transformations, which can improve the equivalence of training samples in the feature space. Indicates that no additional labeling information is required.

Efficient data regularization methods are widely used in the training process and can significantly improve the generalization performance of large-scale models.

Data samplingData sampling

Data sampling is also an effective method, from Selecting a subset from a large batch of samples to update the gradient has the advantage of training in small batches to reduce the impact of unimportant or bad samples in the current batch.

Usually, the sampled data is more important, and the performance is equivalent to that of the model trained using the full batch; the probability of each iteration needs to be gradually adjusted along with the training process. to ensure there is no bias in sampling.

Data-centric Curriculum Learning

Curriculum learning at different stages of the training process Investigate progressive training settings to reduce overall computational cost.

In the beginning, use low-quality data sets to train enough to learn low-level features; then use high-quality data sets (more enhancements and complex pre-processing methods) Gradually helps learn complex features and achieve the same accuracy as using the entire training set.

Model-centered efficient training

Designing an efficient model architecture is always one of the most important studies in the field of deep learning. An excellent model should be an efficient one. A feature extractor that can be projected into easily separated high-level features.

Different from other works that pay special attention to efficient and novel model architectures, this paper pays more attention to equivalent alternatives to common modules in "model-centric" research. Achieve higher training efficiency under comparable conditions.

Almost all large-scale models are composed of small modules or layers, so the investigation of models can provide guidance for efficient training of large-scale models. Researchers mainly focus on the following Research on aspects:

Architecture Efficiency

With the number of parameters in the deep model The sharp increase has also brought huge computational consumption, so it is necessary to implement an efficient alternative to approximate the performance of the original version of the model architecture. This direction has gradually attracted the attention of the academic community; this replacement is not only for numerical calculations Approximation, also includes structural simplification and fusion in deep models.

The researchers differentiate existing acceleration techniques based on different architectures and present some observations and conclusions.

Compression Training Efficiency

Compression has always been a research direction in computing acceleration. One, plays a key role in digital signal processing (multimedia computing/image processing).

Traditional compression includes two main branches: quantization and sparseness. The article details their existing achievements and contributions to deep training.

Initialization Efficiency

Initialization of model parameters in the existing theoretical analysis It is a very important factor in practical scenarios.

A bad initialization state can even cause the entire training to crash and stagnate in the early training phase, while a good initialization state helps speed up within a smooth loss range Regarding the entire convergence speed, this article mainly studies evaluation and algorithm design from the perspective of model initialization.

Model-centric Curriculum Learning

From a model-centric perspective, course learning usually starts training from a small model or partial parameters in a large-scale model, and then gradually recovers to the entire architecture; in the accelerated training process, it shows a larger Advantages, and no obvious negative effects, the article reviews the implementation and efficiency of this method in the training process.

Optimization-centered efficient learning

The acceleration scheme of optimization methods has always been an important research direction in the field of machine learning, which can reduce complexity while achieving optimal conditions. Sex has always been a pursuit in academia.

In recent years, efficient and powerful optimization methods have made important breakthroughs in training deep neural networks. As a basic optimizer widely used in machine learning, the SGD class optimizer has successfully It helps deep models realize various practical applications. However, as the problem becomes increasingly complex, SGD is more likely to fall into local minima and cannot generalize stably.

In order to solve these difficulties, Adam and its variants were proposed to introduce adaptability in updates. This approach has achieved good results in large-scale network training, such as It is used in BERT, Transformer and ViT models.

In addition to the performance of the designed optimizer itself, the combination of accelerated training techniques is also important.

Based on the perspective of optimization, researchers summarized the current thinking on accelerated training into the following aspects:

learning rate Learning rate

Learning rate is an important hyperparameter for non-convex optimization and is also crucial in current deep network training, like Adam Such adaptive methods and their variants have successfully achieved remarkable progress on deep models.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

# Some strategies for adjusting the learning rate based on high-order gradients also effectively achieve accelerated training, and the implementation of learning rate attenuation will also affect training performance in the process.

Large batch size

Using a larger batch size will effectively Improving training efficiency can directly reduce the number of iterations required to complete an epoch training; when the total number of samples is fixed, processing a larger batch size is less expensive than processing multiple small batch size samples, because it can Improve memory utilization and reduce communication bottlenecks.

Efficient objective

The most basic ERM is on the minimization problem Play a key role in making many tasks practical.

With the deepening of research on large networks, some works pay more attention to the gap between optimization and generalization, and propose effective goals to reduce test errors; explain generalization from different perspectives ization and jointly optimizing it during training can greatly speed up the accuracy of testing.

Weighted average Averaged weights

Weighted average is a practical technique that can Enhance the versatility of the model, because the weighted average of historical states is considered, and there is a set of frozen or learnable coefficients, which can greatly speed up the training process.

Budgetized and efficient training

There have been several recent efforts focused on training deep learning models with fewer resources and achieving higher accuracy as much as possible.

This type of problem is defined as budgeted training, that is, training within a given budget (a limit on measurable costs) to achieve the highest model performance.

In order to systematically consider hardware support to approach the real situation, the researchers defined budget training as training on a given device and within a limited time, for example, training on a single low-end deep learning server for one day, to get the model with the best performance.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

Research on budget training can shed light on how to create training recipes for budget training, including decisions about model size, model The configuration of the structure, learning rate arrangement and several other adjustable factors that affect performance, as well as the combination of efficient training techniques suitable for the available budget, this article mainly reviews several advanced techniques of budget training.

System-centered and efficient training

System-centered research is to provide specific implementation methods for the designed algorithms, and to study the ability to truly achieve high efficiency Efficient and practical execution of training hardware.

Researchers focus on the implementation of general-purpose computing devices, such as CPU and GPU devices in multi-node clusters, and resolving potential conflicts in design algorithms from a hardware perspective is the core of concern.

This article mainly reviews the hardware implementation technologies in existing frameworks and third-party libraries. These technologies effectively support the processing of data, models and optimization, and introduces some existing open source The platform provides a solid framework for model establishment, effective use of data for training, mixed precision training and distributed training.

System-centric Data Efficiency

## Efficient Data processing and data parallelism are two important concerns in system implementation.

With the rapid increase in data volume, inefficient data processing has gradually become a bottleneck for training efficiency, especially for large-scale training on multiple nodes. Design more hardware-friendly Computational methods and parallelization can effectively avoid wasting time in training.

System-centric Model Efficiency

With the rapid expansion of the number of model parameters ,From a model perspective, system efficiency has become ,one of the important bottlenecks, and the storage and ,computing efficiency of large-scale models brings huge ,challenges to hardware implementation.

This article mainly reviews how to achieve efficient I/O of deployment and streamlined implementation of model parallelism to speed up actual training.

System-centric Optimization Efficiency

The optimization process represents the The back propagation and update are also the most time-consuming calculations in training, so the implementation of system-centered optimization directly determines the efficiency of training.

In order to clearly interpret the characteristics of system optimization, the article focuses on the efficiency of different calculation stages and reviews the improvements of each process.

Open Source Frameworks

Efficient open source frameworks can facilitate training, as Grafting the bridge between algorithm design and hardware support, the researchers surveyed a range of open source frameworks and analyzed the strengths and weaknesses of each design.

Training big models pays attention to energy! Tao Dacheng leads the team: All the efficient training solutions are covered in one article, stop saying that hardware is the only bottleneck

Conclusion

Researchers review common training acceleration techniques for efficient training of large-scale deep learning models , taking into account all components in the gradient update formula, covering the entire training process in the field of deep learning.

The article also proposes a novel taxonomy, which summarizes these technologies into five main directions: data-centric, model-centric, optimization-centric, budget training and system-centric .

The first four parts mainly conduct comprehensive research from the perspective of algorithm design and methodology, while in the "System-centered Efficient Training" part, we summarize from the perspective of paradigm innovation and hardware support actual implementation.

The article reviews and summarizes commonly used or recently developed technologies corresponding to each section, the advantages and trade-offs of each technology, and discusses limitations and promising future research directions. ; While providing a comprehensive technical review and guidance, this review also proposes current breakthroughs and bottlenecks in efficient training.

The researchers hope to help researchers achieve general training acceleration efficiently and provide some meaningful and promising implications for the future development of efficient training; In addition to the following at the end of each section In addition to some potential developments mentioned, the broader and promising views are as follows:

1. Efficient Profile search

Efficient training can design pre-built and customizable profile search strategies for the model from the perspectives of data enhancement combination, model structure, optimizer design, etc. Related research has achieved some results progress.

New model architectures and compression modes, new pre-training tasks, and the use of “model-edge” knowledge are also worth exploring.

2. Adaptive Scheduler Adaptive Scheduler

Use an optimization-oriented schedule Schedulers such as course learning, learning rate and batch size, as well as model complexity, may achieve better performance; Budget-aware schedulers can dynamically adapt to the remaining budget, reducing the cost of manual design; Adaptive schedulers can be used Explore parallelism and communication methods while taking into account more general and practical scenarios, such as large-scale decentralized training in heterogeneous networks spanning multiple regions and data centers.

The above is the detailed content of Training big models pays attention to 'energy'! Tao Dacheng leads the team: All the 'efficient training' solutions are covered in one article, stop saying that hardware is the only bottleneck. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Are You At Risk Of AI Agency Decay? Take The Test To Find OutApr 21, 2025 am 11:31 AM

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

How to Build an AI Agent from Scratch? - Analytics VidhyaApr 21, 2025 am 11:30 AM

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

Revisiting The Humanities In The Age Of AIApr 21, 2025 am 11:28 AM

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

Understanding LangChain Agent FrameworkApr 21, 2025 am 11:25 AM

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

What are the Radial Basis Functions Neural Networks?Apr 21, 2025 am 11:13 AM

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

The Meshing Of Minds And Machines Has ArrivedApr 21, 2025 am 11:11 AM

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

Insights on spaCy, Prodigy and Generative AI from Ines MontaniApr 21, 2025 am 11:01 AM

This "Leading with Data" episode features Ines Montani, co-founder and CEO of Explosion AI, and co-developer of spaCy and Prodigy. Ines offers expert insights into the evolution of these tools, Explosion's unique business model, and the tr

A Guide to Building Agentic RAG Systems with LangGraphApr 21, 2025 am 11:00 AM

This article explores Retrieval Augmented Generation (RAG) systems and how AI agents can enhance their capabilities. Traditional RAG systems, while useful for leveraging custom enterprise data, suffer from limitations such as a lack of real-time dat

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

Useful JavaScript development tools

Atom editor mac version download

The most popular open source editor

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7631

CakePHP Tutorial

1389

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

141