


Similar to natural language processing, transfer of pre-trained visual backbones improves model performance on a variety of visual tasks. Larger data sets, scalable architectures, and new training methods have all driven improvements in model performance.
However, visual models still lag far behind language models. Specifically, ViT, the largest vision model to date, only has 4B parameters, while entry-level language models often exceed 10B parameters, let alone large language models with 540B parameters.
In order to explore the performance limits of AI models, Google Research recently conducted a study in the field of CV, taking the lead in expanding the Vision Transformer parameter size to 22B and proposing ViT-22B, which is similar to the previous one. Compared with the model parameter amount of 4B, it can be said that this is the largest dense ViT model so far.
Paper address: https://arxiv.org/pdf/2302.05442.pdf
Comparing the previous largest ViT-G and ViT-e, Table 1 gives the comparison results. From the following table, it can be seen that ViT-22B mainly expands the width of the model, making the parameters The volume is larger and the depth is the same as ViT-G.
##The current ViT large model
As this Zhihu netizen said, could it be that Google lost a round on ChatGPT and is bound to compete in the CV field?
How to do it? It turned out that in the early stages of the research, they discovered that training instability occurred during the expansion of ViT, which may lead to architectural changes. The researchers then carefully designed the model and trained it in parallel with unprecedented efficiency. The quality of ViT-22B was assessed through a comprehensive set of tasks, from (few-shot) classification to dense output tasks, where it met or exceeded current SOTA levels. For example, ViT-22B achieved 89.5% accuracy on ImageNet even when used as a frozen visual feature extractor. By training a text tower to match these visual features, it achieves 85.9% zero-shot accuracy on ImageNet. In addition, the model can be regarded as a teacher and used as a distillation target. The researchers trained a ViT-B student model and achieved an accuracy of 88.6% on ImageNet, reaching the SOTA level for a model of this scale.
Model ArchitectureViT-22B is a Transformer-based encoder model similar to the original Vision Transformer architecture, but contains the following three major modifications to improve Efficiency and stability in large-scale training: parallel layers, query/key (QK) normalization and omitted biases.
Parallel layer. As stated in the Wang and Komatsuzaki study, which designed a parallel structure of Attention and MLP:
This can be achieved by combining Linear projection of MLP and attention blocks to achieve additional parallelization. Notably, matrix multiplication for query/key/value projection and the first linear layer of MLP are fused into a single operation, as is the case for out-of-attention projection and the second linear layer of MLP.
QK Normalization. One difficulty in training large models is the stability of the model. In the process of extending ViT, researchers found that the training loss diverges after thousands of rounds of steps. This phenomenon is particularly prominent in the 8B parameter model. To stabilize model training, the researchers adopted the method of Gilmer et al. to apply LayerNorm normalization operations on queries and keys before dot product attention calculations to improve training stability. Specifically, the attention weight is calculated as:
omitted biases. After PaLM, the bias term is removed from the QKV projection and all layernorms are applied without bias, resulting in improved accelerator utilization (3%) without quality degradation. However, unlike PaLM, the researchers used a bias term for the MLP dense layer, and even so, this approach did not compromise speed while taking into account quality.
Figure 2 shows a ViT-22B encoder block. The embedding layer performs operations such as patch extraction, linear projection, and added position embedding based on the original ViT. The researchers used multi-head attention pooling to aggregate each token representation in the heads.
ViT-22B uses a 14 × 14 patch and an image resolution of 224 × 224. ViT-22B employs a learned one-dimensional position embedding. During fine-tuning on high-resolution images, the researchers performed two-dimensional interpolation based on where the pre-trained position embeddings were in the original image.
Training Infrastructure and Efficiency
ViT-22B uses the FLAX library, implemented as JAX, and built in Scenic. It exploits both model and data parallelism. Notably, the researchers used the jax.xmap API, which provides explicit control over sharding of all intermediates (such as weights and activations) as well as inter-chip communication. The researchers organized the chips into a 2D logical grid of size t × k, where t is the size of the data parallel axis and k is the size of the model axis. Then, for each of t groups, k devices acquire the same batch of images, with each device retaining only 1/k activations and being responsible for computing 1/k of all linear layer outputs (details below).
Figure 3: Asynchronous parallel linear operations (y = Ax): overlapping communication and computation across devices Model for parallel matrix multiplication.
Asynchronous parallel linear operations. To maximize throughput, computation and communication must be considered. That is, if you want these operations to be analytically equivalent to the unsharded case, you have to communicate as little as possible, ideally letting them overlap so that you can preserve the matrix multiplication unit (where most of the FLOP's capacity resides) ) is always busy.
Parameter sharding. The model is data parallel in the first axis. Each parameter can be fully replicated on this axis, or each device can be saved with a chunk of it. The researchers chose to split some large tensors from the model parameters to be able to fit larger models and batch sizes.
Using these techniques, ViT-22B processes 1.15k tokens per second per core during training on TPUv4. The model flops utilization (MFU) of ViT-22B is 54.9%, indicating a very efficient use of the hardware. Note that PaLM reports an MFU of 46.2%, while the researchers measured an MFU of 44.0% for ViT-e (data parallelism only) on the same hardware.
Experimental results
The experiment explores the evaluation results of ViT-22B for image classification.
The results in Table 2 show that ViT-22B still has significant improvements in various indicators. Furthermore, studies have shown that linear probing of large models like the ViT-22B can approach or exceed the full fine-tuning performance of smaller models with high resolution, which is often cheaper and easier to do.
The study further tests linear separability on the fine-grained classification data set iNaturalist 2017, comparing ViT-22B with other ViT variants for comparison. The study tested input resolutions of 224px and 384px. The results are shown in Figure 4. The study observed that ViT-22B significantly outperforms other ViT variants, especially at the standard 224px input resolution. This shows that the large number of parameters in ViT-22B is useful for extracting detailed information from images.
Table 3 shows the zero-sample migration results of ViT-22B for CLIP, ALIGN, BASIC, CoCa, and LiT models. The bottom of Table 3 compares the three ViT model performances.
ViT-22B achieves comparable or better results in all ImageNet test sets. Notably, the zero-shot results on the ObjectNet test set are highly correlated with the ViT model size. The largest, ViT-22B, sets a new state-of-the-art on the challenging ObjectNet test set.
Out-of-distribution (OOD). The study constructs a label mapping from JFT to ImageNet, and a label mapping from ImageNet to different out-of-distribution datasets, namely ObjectNet, ImageNet-v2, ImageNet- R, and ImageNet- A.
The results that can be confirmed so far are that, consistent with the improvements on ImageNet, the extended model increases out-of-distribution performance. This works for models that have only seen JFT images, as well as models fine-tuned on ImageNet. In both cases, ViT-22B continues the trend of better OOD performance on larger models (Fig. 5, Table 11).
In addition, the researchers also studied the performance of the ViT-22B model captured in semantic segmentation and monocular depth estimation tasks. Geometric and spatial information quality.
Semantic segmentation. The researchers evaluated ViT-22B as a semantic segmentation backbone on three benchmarks: ADE20K, Pascal Context, and Pascal VOC. As can be seen from Table 4, ViT-22B backbone migration works better when only a few segmentation masks are seen.
Monocular depth estimation. Table 5 summarizes the main findings of the study. As can be observed from the top row (DPT decoder), using ViT-22B features yields the best performance (on all metrics) compared to different backbones. By comparing the ViT-22B backbone to ViT-e, a smaller model but trained on the same data as ViT-22B, we found that extending the architecture improves performance.
In addition, comparing the ViT-e backbone with ViT-L (a similar architecture to ViT-e, but with less training data), the study found that these improvements also come from extensions Data before training. These findings suggest that both larger models and larger datasets help improve performance.
The study also explored on a video dataset. Table 6 shows video classification results on the Kinetics 400 and Moments in Time datasets, demonstrating that competitive results can be achieved using frozen backbones. The study first compares with ViT-e, which has the largest prior visual backbone model consisting of 4 billion parameters and is also trained on the JFT dataset. We observed that the larger ViT-22B model improved by 1.5 points on Kinetics 400 and 1.3 points on Moments in Time.
Final research noted that there is room for further improvement through complete end-to-end fine-tuning.
Please refer to the original paper for more technical details.
The above is the detailed content of Google expanded the visual transfer model parameters to 22 billion, and researchers took collective action since ChatGPT became popular. For more information, please follow other related articles on the PHP Chinese website!

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:https://spj.scien

译者 | 李睿审校 | 孙淑娟近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

Notepad++7.3.1
Easy-to-use and free code editor
