


New ideas for accelerating ViT models! Meta launches Token Merging, which does not rely on pruning but merging
Visual Transformer (ViT) entered the public eye two years ago and has become a core component of computer vision research.
It successfully migrated a Transformer model in the field of natural language processing to the field of computer vision. Since then, progress in the field of computer vision has accelerated.
Despite being surpassed in terms of cost and performance, Vanilla ViT still has many advantages.
They are composed of simple matrix multiplications, which makes them faster than their raw number of operations would indicate.
Additionally, they support powerful self-supervised pre-training techniques such as MAE (Masked Autoencoder) that can produce state-of-the-art results while simultaneously Quick training.
#And because they make no assumptions about the data, they can be applied to many modes such as images, audio, text, etc. with almost no changes.
Of course, the ideal is very full, but the reality is very skinny. The ViT model is large in scale and has a large delay. Running this complex model on a device with limited resources can be very problematic.
Token pruning: getting better, but not completely better
To address the problem of slow operation, researchers Multiple solutions are given. One of the common ways to speed up the vision Transformer model is to perform token pruning.
#Prune tokens at runtime to produce efficient Transformers by pruning less important tokens. For example, DynamicViT prunes redundant tokens hierarchically to reduce FLOPs in classification tasks.
However, token pruning has several problems, the most important of which is that pruning tokens will cause information loss. Therefore, people are not interested in ViT model tokens. The number of pruning is limited. In order to reduce information loss, only unimportant tokens can be pruned.
#Also, in order for the pruned token to be valid, one needs to train the model again. This results in additional resource consumption.
#More importantly, token pruning is a dynamic process, and different numbers of token pruning need to be determined based on different images or sentences. While this is good for improving accuracy, it is not practical enough because in this case the data can no longer be batch processed.
#In order to solve this problem, people need to add masks during the pruning process, which will further affect the efficiency improvement.
# Simply put, token pruning does make ViT run faster, but this is achieved at the cost of information loss.
TokenMerging: Another idea
How to make it ViT is similar in speed to pruning, but maintains higher accuracy than pruning? The Meta AI research team has come up with a new solution: Token Merging (ToMe).
Paper link: https://arxiv.org/pdf/2210.09461.pdf
Token Merging chooses to combine tokens instead of pruning them. Thanks to its custom matching algorithm, it is as fast as pruning while being more accurate. Plus, it works without requiring any additional training, so you can use it on huge models to speed them up without sacrificing a lot of accuracy.
The goal of Meta is to insert a Token Merging module into the existing ViT to improve the throughput of training and inference without requiring additional training by merging redundant tokens.
The basic idea is: in the Transformer model, through merging, each layer is reduced by r tokens. Suppose a Transformer model has L layers, then rL tokens can be reduced by merging. The size of the variable r determines the relationship between speed and accuracy, since fewer markers means lower accuracy but higher throughput.
#It is worth noting that in Token Merging, rL tags will be reduced regardless of the content of the image. This perfectly solves the problem of inability to perform batch processing in token pruning.
With ToMe, batches of similar tokens are merged in each Transformer block: for example, dog fur is merged into a single token.
Token Merging is inserted into every attention block and every Transformer block. This also contrasts with the workflow of token pruning. The latter tends to place the pruning step at the beginning of each Transformer block.
Through Token Merging, the information of tokens that need to be merged can be disseminated, and ViT can also use the attention block Characteristics to determine which tokens need to be merged.
Specific method
The first step of merging It is determined to be similar tokens. Under the condition that QKV (query, key, value) in Transformer has been extracted, through ablation experiments, the research team found that using key can best measure the similarity between tokens (purple part in the figure below).
Because key has summarized the information contained in each token so that it can be used for dot-product in Attention. Measure the similarity between tokens.
In addition to studying which indicator is better for measuring token similarity, you also need to know what distance measures similarity. Through experiments, the research team found that using cosine distance to measure the similarity between tokes can achieve the best relationship between accuracy and speed.
After determining the similarity of tokens, a quick method is needed to determine which tokens need to match to reduce total r.
The Meta team does not use kmeans clustering algorithm or graph segmentation algorithm, but uses a matching algorithm, because the latter can not only accurately match the number of tokens in each layer , and can quickly perform thousands of matches. These cannot be accomplished by iterative clustering algorithms.
Therefore, the Meta team came up with a more efficient solution.
The design goals are as follows. 1.) avoid any iterations that cannot be parallelized, 2.) want the merged changes to be gradual, since clustering has no limit on how many markers can be merged into a group (which may adversely affect the network), while matching Then most tags are not merged.
- Divide all tokens into two sets A and B of the same size.
- Draw an edge from each token in set A to the most similar token in set B.
- #Leave only the most similar r edges and delete the rest.
- #Fusion the edges that are still connected (features are averaged).
- #Put these two sets together to get the final merged result.
Through this unique technology, the throughput and actual training speed of the ViT model can be improved. Using Token Merging can double the training speed. It can be used for image, video, and audio tasks and still achieve state-of-the-art accuracy.
The above is the detailed content of New ideas for accelerating ViT models! Meta launches Token Merging, which does not rely on pruning but merging. For more information, please follow other related articles on the PHP Chinese website!

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

Google's Gemini Advanced: New Subscription Tiers on the Horizon Currently, accessing Gemini Advanced requires a $19.99/month Google One AI Premium plan. However, an Android Authority report hints at upcoming changes. Code within the latest Google P

Despite the hype surrounding advanced AI capabilities, a significant challenge lurks within enterprise AI deployments: data processing bottlenecks. While CEOs celebrate AI advancements, engineers grapple with slow query times, overloaded pipelines, a

Handling documents is no longer just about opening files in your AI projects, it’s about transforming chaos into clarity. Docs such as PDFs, PowerPoints, and Word flood our workflows in every shape and size. Retrieving structured

Harness the power of Google's Agent Development Kit (ADK) to create intelligent agents with real-world capabilities! This tutorial guides you through building conversational agents using ADK, supporting various language models like Gemini and GPT. W

summary: Small Language Model (SLM) is designed for efficiency. They are better than the Large Language Model (LLM) in resource-deficient, real-time and privacy-sensitive environments. Best for focus-based tasks, especially where domain specificity, controllability, and interpretability are more important than general knowledge or creativity. SLMs are not a replacement for LLMs, but they are ideal when precision, speed and cost-effectiveness are critical. Technology helps us achieve more with fewer resources. It has always been a promoter, not a driver. From the steam engine era to the Internet bubble era, the power of technology lies in the extent to which it helps us solve problems. Artificial intelligence (AI) and more recently generative AI are no exception

Harness the Power of Google Gemini for Computer Vision: A Comprehensive Guide Google Gemini, a leading AI chatbot, extends its capabilities beyond conversation to encompass powerful computer vision functionalities. This guide details how to utilize

The AI landscape of 2025 is electrifying with the arrival of Google's Gemini 2.0 Flash and OpenAI's o4-mini. These cutting-edge models, launched weeks apart, boast comparable advanced features and impressive benchmark scores. This in-depth compariso


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6
Visual web development tools

SublimeText3 Chinese version
Chinese version, very easy to use

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
