Home >Technology peripherals >AI >CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

王林
王林forward
2023-04-07 15:09:081366browse

Transformer is undoubtedly the biggest contributor to the prosperity of the field of natural language processing, and is also the infrastructure for large-scale language models such as GPT-4.

However, compared with the tens of billions of parameters of language models, the field of computer vision does not reap the benefits of Transformer so much. Currently, the largest visual Transformer model ViT-e The number of parameters is only 4 billion parameters.

Recently, Google released a paper in which researchers proposed a method that can efficiently and stably train large-scale Vision Transformers (ViT) models, successfully increasing the number of parameters of ViT to 22 billion.


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

## Paper link: https://arxiv.org/abs/2302.05442

In order to achieve model expansion, ViT-22B combines ideas from other language models (such as the PaLM model), uses QK normalization to improve training stability, and proposes a Asynchronous parallel linear operations (asynchronous parallel linear operations)’s new method improves training efficiency and can be trained on Cloud TPU with higher hardware efficiency.

When conducting experiments on the ViT-22B model to evaluate downstream task performance, ViT-22B also showed capabilities similar to large-scale language models, that is, as the model scale increases, the performance It is also constantly improving.

ViT-22B can also be used in PaLM-e. The large model combined with the language model can significantly improve the technical level of robot tasks.

The researchers also further observed other advantages brought by scale, including a better balance of fairness and performance, consistent with human visual perception in terms of shape/texture bias sex, and better robustness.

Model architecture

ViT-22B is a model based on the Transformer architecture. Compared with the original ViT architecture, the researchers mainly made three modifications to improve training efficiency. and training stability.

Parallel layers

##ViT-22B executes the attention block and MLP block in parallel, while The original Transformer is executed sequentially.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

PaLM model training also uses this method, which can increase the training speed of large models by 15% without performance degradation.

query/key (QK) normalization

In the process of expanding ViT, researchers used 8 billion parameters It has been observed in a large number of models that the training loss begins to diverge after a few thousand steps of training, mainly due to the instability caused by excessively large values ​​​​of attention logits, resulting in zero-entropy attention weights (almost one-hot). .

In order to solve this problem, the researchers used LayerNorm on Query and Key before dot multiplication attention calculation

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

The experimental results on the 8 billion parameter model are shown in the figure below. Normalization can alleviate the divergence problem.


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Remove the offset term on QKV projection and LayerNorms

Like the PaLM model, ViT-22B removes the bias term from the QKV projection, and there is no bias term (bias) and centering in all LayerNorms, resulting in a 3% increase in hardware utilization , and there is no loss in quality.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

However, unlike PaLM, ViT-22B uses a bias term for the (internal and external) MLP dense connection layer, which can be observed Quality has improved, and speed hasn't slowed down.

In the encoder module of ViT-22B, the embedding layer, including extraction patches, linear projections and additional position embeddings are the same as those used in the original ViT, and multi-head attention pooling is used to aggregate the information in each head. per-token representation.

The patch size of ViT-22B is 14×14, and the resolution of the image is 224×224 (preprocessed by inception crop and random horizontal flipping).

Asynchronous parallel linear operations

Large-scale models also require sharding ), that is, distributing model parameters across different computing devices. In addition, researchers also slice activations (acctivations, intermediate representations of input).

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Because both the input and the matrix itself are distributed across various devices, even simple operations like matrix multiplication require special care.

The researchers developed a method called asynchronous parallel linear operations that can be performed simultaneously while calculating in the matrix multiplication unit (the unit that accounts for the vast majority of computing power in the TPU). Communicate activations and weights between devices.

Asynchronous methods minimize the time waiting for incoming communication, thereby increasing device efficiency.

The goal of asynchronous parallel linear operations is to calculate matrix multiplication y = Ax, but the matrix A and activation x are distributed on different devices and require overlapping communication and calculations across devices. achieve this. The matrix A is column-sharded across devices. Each matrix contains a contiguous slice, with each block represented as Aij. See the original paper for more details.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Experimental results

In order to illustrate that the representation learned by ViT-22B is very rich, the researchers used LiT-tuning to train a text model to generate representations for aligning text and images.

The following are the experimental results obtained using out-of-distribution images generated by Parti and Imagen. You can see the zero-shot image classification generalization ability of ViT-22B It is very powerful and can recognize unseen objects and scenes only from natural images crawled from the web.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

The paper also discusses the effect of ViT-22B on video classification, depth estimation and semantic segmentation tasks.

Aligned with human target recognition

In order to verify the consistency of ViT-22B classification decision-making with human classification decision-making, the researchers fine-tuned ViT-22B and changed the distribution It is fine-tuned on different resolutions of the OOD dataset, where human comparison data is available through the model-vs-human toolbox.

This toolbox mainly measures three key indicators: How does the model handle distortion (accuracy)? What is the difference between human and model accuracy (difference in accuracy)? How similar are the error patterns (error consistency) of people and models?

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Shape deviation evaluation (larger values ​​represent more shape deviations). Many vision models have low shape/high texture bias, and ViT-22B, fine-tuned on ImageNet, has the highest shape bias recorded among ML models to date, closer to human shape bias

Experimental results show that while not all fine-tuned solutions perform well, the ViT-22B variant reaches new highs on all three metrics.

Additionally, the ViT-22B model also has the highest shape deviation record among visual models. This means that they mainly use the shape of the object rather than the texture of the object to make classification decisions, and the strategy results are similar to human perception (its shape bias is 96%).

Standard models (e.g. ResNet-50 has 20-30% shape bias) typically classify based on texture, while models with high shape bias tend to focus on shape (identified below) (for cats), ViT-22B shows more similarities to human visual object recognition, although there are still many differences between human and model perception.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Cat or elephant? Car or clock? Bird or bicycle? Images with the shape of one object and the texture of another different object can be used to measure shape/texture deviation

Out-of-distribution performance

Measuring performance on the OOD data set helps evaluate model generalization.

In this experiment, the researchers constructed label mappings from JFT to ImageNet, and from ImageNet to different out-of-distribution datasets such as ObjectNet.

The results after pre-training on this data are shown below, and then the model is fully fine-tuned on ImageNet.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

It can be observed that scaling Vision Transformers can improve OOD performance: even if the accuracy of ImageNet reaches saturation, it can also be seen that the transformation from ViT-e on ObjectNet The ViT-22B model can significantly improve performance.

Linear Probe

Linear Probe is a technique that places a single linear layer on top of a frozen model. Compared to full fine-tuning, this method has It’s cheaper to train and easier to set up.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Linear detection results trained on ImageNet, on ImageNet-Real, ImageNet-v2, ObjectNet, ImageNet-R and ImageNet- Evaluation on data set A, providing high-resolution fine-tuned ViT-e/14 as reference

It can be observed from the results that the linear detection performance of ViT-22B is close to that of using State-of-the-art fine-tuning of smaller models on high-resolution images, where training with higher resolutions is generally much more expensive but can achieve better results on many tasks.

Distillation

Using the distillation method, the knowledge of a larger model can be converted into the knowledge of a smaller model, which can improve the cost of larger models and the slower running speed. The operating efficiency of the model.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

It can be found from the experimental results that the knowledge of ViT-22B can be transferred to smaller models such as ViT-B/16 and ViT- L/16, and refreshed the performance record on ImageNet under the same model size.

Fairness and Bias

Machine learning models are susceptible to unintended unfair biases, such as finding false correlations or across subgroups performance gaps, the researchers found that scaling up the model could help mitigate these issues.

First, scale is a promising trade-off, even if the model is trained and then post-processed to control its demographic parity to a prescribed and tolerable level. Below the level, performance also improves as scale increases.


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans


CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

#Above: Accuracy of each subgroup in CelebA before debiasing. Below: The y-axis shows the absolute difference in performance for the two specific subgroups highlighted in this example (females and males). Compared to the smaller ViT model, the performance gap of the ViT-22B is very small.

More importantly, this applies not only to the case where performance is measured in terms of accuracy, but also to other measures such as calibration, i.e. the truthfulness of the model's estimated probabilities. Statistically measured, classification of all subgroups tends to improve with increasing size, and ViT-22B reduces the performance gap between subgroups.

Conclusion

The researchers proposed one of the largest visual Transformer models currently, ViT-22B, containing 22 billion parameters.

By making small but key modifications to the original model architecture, higher hardware utilization and training stability were achieved, resulting in a model that improved the upper limit of performance on several benchmarks.

Using the frozen model to generate embeddings requires only training a few layers on top to achieve very good performance, and the evaluation results further show that compared to existing models, ViT-22B Shows greater similarity to human visual perception in terms of shape and texture bias, and provides advantages in terms of fairness and robustness.

The above is the detailed content of CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete