Home >Technology peripherals >AI >ConvNeXt V2 is here, using only the simplest convolution architecture, the performance is not inferior to Transformer
After decades of basic research, the field of visual recognition has ushered in a new era of large-scale visual representation learning. Pretrained large-scale vision models have become an essential tool for feature learning and vision applications. The performance of a visual representation learning system is greatly affected by three main factors: the model's neural network architecture, the method used to train the network, and the training data. Improvements in each factor contribute to improvements in overall model performance.
Innovation in neural network architecture design has always played an important role in the field of representation learning. The convolutional neural network architecture (ConvNet) has had a significant impact on computer vision research, enabling the use of universal feature learning methods in various visual recognition tasks without relying on manually implemented feature engineering. In recent years, the transformer architecture, originally developed for natural language processing, has also become widely used in other deep learning fields because of its suitability for models and datasets of different sizes.
The emergence of the ConvNeXt architecture modernizes the traditional ConvNet, proving that pure convolutional models can also adapt to changes in model and dataset size. However, the most common way to explore the design space of neural network architectures is still to benchmark the performance of supervised learning on ImageNet.
Another way of thinking is to shift the focus of visual representation learning from labeled supervised learning to self-supervised pre-training. Self-supervised algorithms introduced masked language modeling into the field of vision and quickly became a popular method for visual representation learning. However, self-supervised learning typically uses an architecture designed for supervised learning and assumes that the architecture is fixed. For example, Masked Autoencoder (MAE) uses a visual transformer architecture.
One approach is to combine these architectures with a self-supervised learning framework, but this will face some specific problems. For example, the following problem arises when combining ConvNeXt with MAE: MAE has a specific encoder-decoder design that is optimized for the sequence processing capabilities of the transformer, which makes the computationally intensive encoder focus on those visible patch, thereby reducing pre-training costs. But this design may be incompatible with standard ConvNet, which uses dense sliding windows. Furthermore, without considering the relationship between architecture and training objectives, it is unclear whether optimal performance can be achieved. In fact, existing research shows that it is difficult to train ConvNet with mask-based self-supervised learning, and experimental evidence shows that transformer and ConvNet may diverge in feature learning, which will affect the quality of the final representation.
To this end, researchers from KAIST, Meta, and New York University (including Liu Zhuang, first author of ConvNeXt, and Xie Saining, first author of ResNeXt) proposed to jointly design network architecture and masked autoencoding under the same framework The purpose of this is to enable mask-based self-supervised learning to be applied to the ConvNeXt model and obtain results comparable to the transformer.
##Paper address: https://arxiv.org/pdf/2301.00808v1.pdf
When designing the masked autoencoder, this research treats the input with the mask as a set of sparse patches and uses sparse convolution to process the visible parts. The idea was inspired by the use of sparse convolutions when processing large-scale 3D point clouds. Specifically, this research proposes implementing ConvNeXt with sparse convolutions, and then during fine-tuning, the weights can be converted back to standard dense network layers without special processing. To further improve pre-training efficiency, this study replaces the transformer decoder with a single ConvNeXt, making the entire design fully convolutional. The researchers observed that after adding these changes: the learned features were useful and improved the baseline results, but the fine-tuned performance was still inferior to the transformer-based model.
The study then analyzes the feature space of ConvNeXt with different training configurations. When training ConvNeXt directly on masked inputs, researchers found potential feature collapse problems in the MLP layer. In order to solve this problem, this study proposes to add a global response normalization layer (Global Response Normalization layer) to enhance feature competition between channels. This improvement is most effective when the model is pretrained using a masked autoencoder, suggesting that reusing fixed architecture designs from supervised learning may not be the best approach.
Based on the above improvements, this study proposes ConvNeXt V2, which shows better performance when combined with masked autoencoders. At the same time, researchers found that ConvNeXt V2 has significant performance improvements over pure ConvNet on various downstream tasks, including classification tasks on ImageNet, target detection on COCO, and semantic segmentation on ADE20K.
Fully Convolutional Masked Autoencoder
The method proposed in this study is conceptually simple and is implemented in a fully convolutional manner In progress. The learning signal is generated by randomly masking the original visual input with a high mask ratio, and then letting the model predict the missing parts based on the remaining context. The overall framework is shown in the figure below.
The framework consists of a ConvNeXt encoder based on sparse convolution and a lightweight ConvNeXt decoder, where the structure of the autoencoder is Asymmetrical. The encoder only processes visible pixels, while the decoder uses encoded pixels and mask tokens to reconstruct the image. At the same time, the loss is only calculated in the masked area.
Global response normalization
There are many mechanisms in the brain that promote neuronal diversity. For example, lateral inhibition can help enhance the response of activated neurons, increasing the contrast and selectivity of individual neurons to stimuli while also increasing the response diversity of the entire population of neurons. In deep learning, this form of lateral inhibition can be achieved through response normalization. This study introduces a new response normalization layer called global response normalization (GRN), which aims to increase the contrast and selectivity between channels. The GRN unit consists of three steps: 1) global feature aggregation, 2) feature normalization, and 3) feature calibration. As shown in the figure below, GRN layers can be merged into the original ConvNeXt block.
The researchers found experimentally that when applying GRN, LayerScale is not necessary and can be deleted. Leveraging this new block design, the study created a variety of models with varying efficiencies and capacities, termed the ConvNeXt V2 model family, ranging from lightweight (Atto) to computationally intensive (Huge).
To evaluate the role of GRN, this study used the FCMAE framework to pre-train ConvNeXt V2. From the visual display in Figure 3 below and the cosine distance analysis in Figure 4, it can be observed that ConvNeXt V2 effectively alleviates the feature collapse problem. The cosine distance values are consistently high, indicating that feature diversity can be maintained during the transfer of network layers. This is similar to the ViT model pretrained using MAE. This shows that the learning behavior of ConvNeXt V2 is similar to ViT under a similar mask image pre-training framework.
The study further evaluated the fine-tuning performance, and the results are shown in the table below.
When equipped with GRN, the FCMAE pre-trained model can significantly outperform the supervised model trained using 300 epochs. GRN improves representation quality by enhancing feature diversity, which is crucial for mask-based pre-training and is absent in the ConvNeXt V1 model. It is worth noting that this improvement is achieved without adding additional parameter overhead and without increasing FLOPS.
Finally, the study also examines the importance of GRN in pre-training and fine-tuning. As shown in Table 2(f) below, performance drops significantly whether GRN is removed from fine-tuning or newly initialized GRN is added during fine-tuning, indicating that GRN is important in both pre-training and fine-tuning.
Interested readers can read the original text of the paper to learn more about the research details.
The above is the detailed content of ConvNeXt V2 is here, using only the simplest convolution architecture, the performance is not inferior to Transformer. For more information, please follow other related articles on the PHP Chinese website!