CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans-AI-php.cn

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

王林

Apr 07, 2023 pm 03:09 PM

Model

Transformer is undoubtedly the biggest contributor to the prosperity of the field of natural language processing, and is also the infrastructure for large-scale language models such as GPT-4.

However, compared with the tens of billions of parameters of language models, the field of computer vision does not reap the benefits of Transformer so much. Currently, the largest visual Transformer model ViT-e The number of parameters is only 4 billion parameters.

Recently, Google released a paper in which researchers proposed a method that can efficiently and stably train large-scale Vision Transformers (ViT) models, successfully increasing the number of parameters of ViT to 22 billion.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

## Paper link: https://arxiv.org/abs/2302.05442

In order to achieve model expansion, ViT-22B combines ideas from other language models (such as the PaLM model), uses QK normalization to improve training stability, and proposes a Asynchronous parallel linear operations (asynchronous parallel linear operations)’s new method improves training efficiency and can be trained on Cloud TPU with higher hardware efficiency.

When conducting experiments on the ViT-22B model to evaluate downstream task performance, ViT-22B also showed capabilities similar to large-scale language models, that is, as the model scale increases, the performance It is also constantly improving.

ViT-22B can also be used in PaLM-e. The large model combined with the language model can significantly improve the technical level of robot tasks.

The researchers also further observed other advantages brought by scale, including a better balance of fairness and performance, consistent with human visual perception in terms of shape/texture bias sex, and better robustness.

Model architecture

ViT-22B is a model based on the Transformer architecture. Compared with the original ViT architecture, the researchers mainly made three modifications to improve training efficiency. and training stability.

Parallel layers

##ViT-22B executes the attention block and MLP block in parallel, while The original Transformer is executed sequentially.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

PaLM model training also uses this method, which can increase the training speed of large models by 15% without performance degradation.

query/key (QK) normalization

In the process of expanding ViT, researchers used 8 billion parameters It has been observed in a large number of models that the training loss begins to diverge after a few thousand steps of training, mainly due to the instability caused by excessively large values of attention logits, resulting in zero-entropy attention weights (almost one-hot). .

In order to solve this problem, the researchers used LayerNorm on Query and Key before dot multiplication attention calculation

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

The experimental results on the 8 billion parameter model are shown in the figure below. Normalization can alleviate the divergence problem.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Remove the offset term on QKV projection and LayerNorms

Like the PaLM model, ViT-22B removes the bias term from the QKV projection, and there is no bias term (bias) and centering in all LayerNorms, resulting in a 3% increase in hardware utilization , and there is no loss in quality.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

However, unlike PaLM, ViT-22B uses a bias term for the (internal and external) MLP dense connection layer, which can be observed Quality has improved, and speed hasn't slowed down.

In the encoder module of ViT-22B, the embedding layer, including extraction patches, linear projections and additional position embeddings are the same as those used in the original ViT, and multi-head attention pooling is used to aggregate the information in each head. per-token representation.

The patch size of ViT-22B is 14×14, and the resolution of the image is 224×224 (preprocessed by inception crop and random horizontal flipping).

Asynchronous parallel linear operations

Large-scale models also require sharding ), that is, distributing model parameters across different computing devices. In addition, researchers also slice activations (acctivations, intermediate representations of input).

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Because both the input and the matrix itself are distributed across various devices, even simple operations like matrix multiplication require special care.

The researchers developed a method called asynchronous parallel linear operations that can be performed simultaneously while calculating in the matrix multiplication unit (the unit that accounts for the vast majority of computing power in the TPU). Communicate activations and weights between devices.

Asynchronous methods minimize the time waiting for incoming communication, thereby increasing device efficiency.

The goal of asynchronous parallel linear operations is to calculate matrix multiplication y = Ax, but the matrix A and activation x are distributed on different devices and require overlapping communication and calculations across devices. achieve this. The matrix A is column-sharded across devices. Each matrix contains a contiguous slice, with each block represented as Aij. See the original paper for more details.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Experimental results

In order to illustrate that the representation learned by ViT-22B is very rich, the researchers used LiT-tuning to train a text model to generate representations for aligning text and images.

The following are the experimental results obtained using out-of-distribution images generated by Parti and Imagen. You can see the zero-shot image classification generalization ability of ViT-22B It is very powerful and can recognize unseen objects and scenes only from natural images crawled from the web.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

The paper also discusses the effect of ViT-22B on video classification, depth estimation and semantic segmentation tasks.

Aligned with human target recognition

In order to verify the consistency of ViT-22B classification decision-making with human classification decision-making, the researchers fine-tuned ViT-22B and changed the distribution It is fine-tuned on different resolutions of the OOD dataset, where human comparison data is available through the model-vs-human toolbox.

This toolbox mainly measures three key indicators: How does the model handle distortion (accuracy)? What is the difference between human and model accuracy (difference in accuracy)? How similar are the error patterns (error consistency) of people and models?

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Shape deviation evaluation (larger values represent more shape deviations). Many vision models have low shape/high texture bias, and ViT-22B, fine-tuned on ImageNet, has the highest shape bias recorded among ML models to date, closer to human shape bias

Experimental results show that while not all fine-tuned solutions perform well, the ViT-22B variant reaches new highs on all three metrics.

Additionally, the ViT-22B model also has the highest shape deviation record among visual models. This means that they mainly use the shape of the object rather than the texture of the object to make classification decisions, and the strategy results are similar to human perception (its shape bias is 96%).

Standard models (e.g. ResNet-50 has 20-30% shape bias) typically classify based on texture, while models with high shape bias tend to focus on shape (identified below) (for cats), ViT-22B shows more similarities to human visual object recognition, although there are still many differences between human and model perception.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Cat or elephant? Car or clock? Bird or bicycle? Images with the shape of one object and the texture of another different object can be used to measure shape/texture deviation

Out-of-distribution performance

Measuring performance on the OOD data set helps evaluate model generalization.

In this experiment, the researchers constructed label mappings from JFT to ImageNet, and from ImageNet to different out-of-distribution datasets such as ObjectNet.

The results after pre-training on this data are shown below, and then the model is fully fine-tuned on ImageNet.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

It can be observed that scaling Vision Transformers can improve OOD performance: even if the accuracy of ImageNet reaches saturation, it can also be seen that the transformation from ViT-e on ObjectNet The ViT-22B model can significantly improve performance.

Linear Probe

Linear Probe is a technique that places a single linear layer on top of a frozen model. Compared to full fine-tuning, this method has It’s cheaper to train and easier to set up.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

Linear detection results trained on ImageNet, on ImageNet-Real, ImageNet-v2, ObjectNet, ImageNet-R and ImageNet- Evaluation on data set A, providing high-resolution fine-tuned ViT-e/14 as reference

It can be observed from the results that the linear detection performance of ViT-22B is close to that of using State-of-the-art fine-tuning of smaller models on high-resolution images, where training with higher resolutions is generally much more expensive but can achieve better results on many tasks.

Distillation

Using the distillation method, the knowledge of a larger model can be converted into the knowledge of a smaller model, which can improve the cost of larger models and the slower running speed. The operating efficiency of the model.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

It can be found from the experimental results that the knowledge of ViT-22B can be transferred to smaller models such as ViT-B/16 and ViT- L/16, and refreshed the performance record on ImageNet under the same model size.

Fairness and Bias

Machine learning models are susceptible to unintended unfair biases, such as finding false correlations or across subgroups performance gaps, the researchers found that scaling up the model could help mitigate these issues.

First, scale is a promising trade-off, even if the model is trained and then post-processed to control its demographic parity to a prescribed and tolerable level. Below the level, performance also improves as scale increases.

CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans

#Above: Accuracy of each subgroup in CelebA before debiasing. Below: The y-axis shows the absolute difference in performance for the two specific subgroups highlighted in this example (females and males). Compared to the smaller ViT model, the performance gap of the ViT-22B is very small.

More importantly, this applies not only to the case where performance is measured in terms of accuracy, but also to other measures such as calibration, i.e. the truthfulness of the model's estimated probabilities. Statistically measured, classification of all subgroups tends to improve with increasing size, and ViT-22B reduces the performance gap between subgroups.

Conclusion

The researchers proposed one of the largest visual Transformer models currently, ViT-22B, containing 22 billion parameters.

By making small but key modifications to the original model architecture, higher hardware utilization and training stability were achieved, resulting in a model that improved the upper limit of performance on several benchmarks.

Using the frozen model to generate embeddings requires only training a few layers on top to achieve very good performance, and the evaluation results further show that compared to existing models, ViT-22B Shows greater similarity to human visual perception in terms of shape and texture bias, and provides advantages in terms of fairness and robustness.

The above is the detailed content of CV opens the era of large models! Google releases the largest ViT in history: 22 billion parameters, visual perception is close to that of humans. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Are You At Risk Of AI Agency Decay? Take The Test To Find OutApr 21, 2025 am 11:31 AM

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

How to Build an AI Agent from Scratch? - Analytics VidhyaApr 21, 2025 am 11:30 AM

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

Revisiting The Humanities In The Age Of AIApr 21, 2025 am 11:28 AM

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

Understanding LangChain Agent FrameworkApr 21, 2025 am 11:25 AM

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

What are the Radial Basis Functions Neural Networks?Apr 21, 2025 am 11:13 AM

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

The Meshing Of Minds And Machines Has ArrivedApr 21, 2025 am 11:11 AM

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

Insights on spaCy, Prodigy and Generative AI from Ines MontaniApr 21, 2025 am 11:01 AM

This "Leading with Data" episode features Ines Montani, co-founder and CEO of Explosion AI, and co-developer of spaCy and Prodigy. Ines offers expert insights into the evolution of these tools, Explosion's unique business model, and the tr

A Guide to Building Agentic RAG Systems with LangGraphApr 21, 2025 am 11:00 AM

This article explores Retrieval Augmented Generation (RAG) systems and how AI agents can enhance their capabilities. Traditional RAG systems, while useful for leveraging custom enterprise data, suffer from limitations such as a lack of real-time dat

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.