Home >Technology peripherals >AI >Simple and universal: 3 times lossless training acceleration of visual basic network, Tsinghua EfficientTrain++ selected for TPAMI 2024
The author of this discussion paper, Wang Yulin, is a 2019 direct doctoral student in the Department of Automation, Tsinghua University. He studied under Academician Wu Cheng and Associate Professor Huang Gao. His main research directions are efficient deep learning, computer vision, etc. He has published discussion papers as the first author in journals and conferences such as TPAMI, NeurIPS, ICLR, ICCV, CVPR, ECCV, etc. He has received Baidu Scholarship, Microsoft Scholar, CCF-CV Academic Emerging Award, ByteDance Scholarship and other honors. Personal homepage: wyl.cool.
This article mainly introduces an article that has just been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI): EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training.
However, "scaling" often brings prohibitive
high model training overhead, which significantly hinders the further development and industrial application of basic vision models.
To solve this problem, the research team of Tsinghua University proposed a generalized curriculum learning algorithm:EfficientTrain++. The core idea is to promote the traditional course learning paradigm of "screening and using data from easy to difficult, and gradually training the model" to "not filtering data dimensions, always using all training data, but gradually revealing each feature during the training process" Characteristics or patterns (pattern) from easy to difficult of each data sample."
EfficientTrain++ has several important highlights:
one. Research motivation
However, large basic models often have high training overhead. Figure 1 gives two typical examples. Taking 8 NVIDIA V100 or higher-performance GPUs as an example, it would take years or even decades to complete just one training session for GPT-3 and ViT-G. Such high training costs are a huge expense that is difficult to afford for both academia and industry. Often only a few high-level institutions consume large amounts of resources to advance the progress of deep learning. Therefore, an urgent problem to be solved is: how to effectively improve the training efficiency of large-scale deep learning models?
Figure 1 Example: High training overhead of large deep learning basic models
For computer vision models, a classic The idea is curriculum learning, as shown in Figure 2, which imitates the progressive and highly structured learning process of humans. During the model training process, we start with the "simplest" training data and gradually introduce it from easy to difficult. The data.
Figure 2 Classic Curriculum Learning Paradigm (Picture Source: "A Survey on Curriculum Learning", TPAMI'22)
However ,Although the motivation is relatively natural, course learning ,has not been widely applied as a general method for training ,visual basic models. The main reason is that there are ,two key bottlenecks, as shown in Figure 3. First, designing an effective training curriculum (curriculum) is not easy. Distinguishing between "simple" and "difficult" samples often requires the help of additional pre-training models, designing more complex AutoML algorithms, introducing reinforcement learning, etc., and has poor versatility. Second, the modeling of course learning itself is somewhat unreasonable. Visual data in natural distribution often has a high degree of diversity. An example is given below in Figure 3 (parrot pictures randomly selected from ImageNet). The model training data contains a large number of parrots with different movements, parrots at different distances from the camera, Parrots from different perspectives and backgrounds, as well as the diverse interactions between parrots and people or objects, etc., it is actually a relatively rough method to distinguish such diverse data only by single-dimensional indicators of "simple" and "difficult" and far-fetched modeling methods.
Figure 3 Two key bottlenecks that hinder large-scale application of course learning in training visual basic models
Inspired by the above challenges, this paper proposes a generalized curriculum learning paradigm. The core idea is to make "filtering and use easy The traditional course learning paradigm of "obtaining difficult data and gradually training the model" has been extended to "does not filter the data dimensions and always uses all training data, but gradually reveals the reasons for each data sample during the training process. Difficult features or patterns", thus effectively avoiding the limitations and sub-optimal designs caused by the data screening paradigm, as shown in Figure 4.
Figure 4 Traditional curriculum learning (sample dimension) vs. generalized curriculum learning (feature dimension)
The main reasons for this paradigm Based on an interesting phenomenon: In the training process of a natural visual model, although the model can always obtain all the information contained in the data at any time, the model will always naturally learn to recognize some simpler information contained in the data first. The discriminant features (pattern), and then gradually learn to identify more difficult discriminant features on this basis. Moreover, this rule is relatively universal, "relatively simple" discriminant features can be found more easily in both the frequency domain and the spatial domain. This paper designed a series of interesting experiments to demonstrate the above findings, as described below.
From a frequency domain perspective, "low-frequency features" are "relatively simple" for the model. In Figure 5, the author of this article trained a DeiT-S model using standard ImageNet-1K training data, and used low-pass filters with different bandwidths to filter the verification set, retaining only the low-frequency components of the verification image, and reports on this basis. The accuracy of DeiT-S on the low-pass filtered verification data during the training process. The curve of the obtained accuracy relative to the training process is shown on the right side of Figure 5.
We can see an interesting phenomenon: in the early stages of training, using only low-pass filtered validation data will not significantly reduce the accuracy, and the curve is consistent with the normal validation set accuracy. The separation point gradually moves to the right as the filter bandwidth increases. This phenomenon shows that although the model always has access to the low- and high-frequency parts of the training data, its learning process naturally starts by focusing only on low-frequency information, and the ability to identify higher-frequency features is gradually acquired later in the training (this phenomenon For more evidence, please refer to the original text).
Figure 5 From a frequency domain perspective, the model naturally tends to learn to identify low-frequency features first
This finding leads to an interesting Question: Can we design a training curriculum that starts with low-frequency information that only provides visual input to the model, and then gradually introduces high-frequency information?
Figure 6 investigates the idea of performing low-pass filtering on the training data only during an early training phase of a specific length, leaving the rest of the training process unchanged. It can be observed from the results that although the final performance improvement is limited, it is interesting to note that the final accuracy of the model can be preserved to a large extent even if only low-frequency components are provided to the model for a considerable period of early training phase, which It also coincides with the observation in Figure 5 that "the model mainly focuses on learning to identify low-frequency features in the early stages of training".
This discovery inspired the author of this article to think about training efficiency: Since the model only needs low-frequency components in the data in the early stages of training, and the information contained in the low-frequency components is smaller than the original data, then it can Can the model efficiently learn from only low-frequency components at less computational cost than processing the original input?
Figure 6 Providing only low-frequency components to the model for a long period of early training does not significantly affect the final performance
In fact, this idea is completely feasible. As shown on the left side of Figure 7, the author of this article introduces a cropping operation in the Fourier spectrum of the image to crop out the low-frequency part and map it back to the pixel space. This low-frequency cropping operation accurately preserves all low-frequency information while reducing the size of the image input, so the computational cost of the model learning from the input can be exponentially reduced.
If you use this low-frequency cropping operation to process the model input in the early stages of training, you can significantly save the overall training cost, but because the information necessary for model learning is maximally retained, The final model with almost no performance loss can still be obtained, and the experimental results are shown in the lower right corner of Figure 7.
Figure 7 Low-frequency cropping: Make the model efficiently learn only from low-frequency information
In addition to frequency domain operations, from the perspective of spatial domain transformation, "relatively simple" features for the model can also be found. For example, natural image information contained in raw visual input that has not undergone strong data enhancement or distortion processing is often "simpler" for the model and easier for the model to learn because it is derived from real-world distributions. , and the additional information, invariance, etc. introduced by preprocessing techniques such as data enhancement are often difficult for the model to learn (a typical example is given on the left side of Figure 8).
In fact, existing research has also observed that data augmentation mainly plays a role in the later stages of training (such as "Improving Auto-Augment via Augmentation-Wise Weight Sharing", NeurIPS' 20).
在这一维度上,为实现广义课程学习的范式,可以简单地通过改变数据增强的强度方便地实现在训练早期阶段仅向模型提供训练数据中较容易学习的自然图像信息。图 8 右侧使用 RandAugment 作为代表性示例来验证了这个思路,RandAugment 包含了一系列常见的空域数据增强变换(例如随机旋转、更改锐度、仿射变换、更改曝光度等)。
可以观察到,从较弱的数据增强开始训练模型可以有效提高模型最终表现,同时这一技术与低频裁切兼容。
图 8 从空域的角度寻找模型 “较容易学习” 的特征:一个数据增强的视角
到此处为止,本文提出了广义课程学习的核心框架与假设,并通过揭示频域、空域的两个关键现象证明了广义课程学习的合理性和有效性。在此基础上,本文进一步完成了一系列系统性工作,在下面列出。由于篇幅所限,关于更多研究细节,可参考原论文。
本文最终得到的 EfficientTrain++ 广义课程学习方案如图 9 所示。EfficientTrain++ 以模型训练总计算开销的消耗百分比为依据,动态调整频域低频裁切的带宽和空域数据增强的强度。
值得注意的是,作为一种即插即用的方法,EfficientTrain++ 无需进一步的超参数调整或搜索即可直接应用于多种视觉基础网络和多样化的模型训练场景,效果比较稳定、显著。
图 9 统一、整合的广义课程学习方案:EfficientTrain++
作为一种即插即用的方法,EfficientTrain++ 在 ImageNet-1K 上,在基本不损失或提升性能的条件下,将多种视觉基础网络的实际训练开销降低了 1.5 倍左右。
图 10 ImageNet-1K 实验结果:EfficientTrain++ 在多种视觉基础网络上的表现
EfficientTrain++ 的增益通用于不同的训练开销预算,严格相同表现的情况下,DeiT/Swin 在 ImageNet-1K 上的训加速比在 2-3 倍左右。
图 11 ImageNet-1K 实验结果:EfficientTrain++ 在不同训练开销预算下的表现
EfficientTrain++ 在 ImageNet-22k 上可以取得 2-3 倍的性能无损预训练加速。
图 12 ImageNet-22K 实验结果:EfficientTrain++ 在更大规模训练数据上的表现
对于较小的模型,EfficientTrain++ 可以实现显著的性能上界提升。
图13 ImageNet-1K 实验结果:EfficientTrain++ 可以显着提升较小模型的性能上界
EfficientTrain++ 对于自监督学习算法(如MAE)同样有效。
图14 EfficientTrain++ 可以应用于自监督学习(如MAE)
EfficientTrain++ 训得的模型在目标检测、实例分割、语义分割等下游任务上同样不损失性能。
图 15 COCO 目标检测、COCO 实例分割、ADE20K 语义分割实验结果
The above is the detailed content of Simple and universal: 3 times lossless training acceleration of visual basic network, Tsinghua EfficientTrain++ selected for TPAMI 2024. For more information, please follow other related articles on the PHP Chinese website!