Home > Article > Technology peripherals > New ideas for accelerating ViT models! Meta launches Token Merging, which does not rely on pruning but merging
Visual Transformer (ViT) entered the public eye two years ago and has become a core component of computer vision research.
It successfully migrated a Transformer model in the field of natural language processing to the field of computer vision. Since then, progress in the field of computer vision has accelerated.
Despite being surpassed in terms of cost and performance, Vanilla ViT still has many advantages.
They are composed of simple matrix multiplications, which makes them faster than their raw number of operations would indicate.
Additionally, they support powerful self-supervised pre-training techniques such as MAE (Masked Autoencoder) that can produce state-of-the-art results while simultaneously Quick training.
#And because they make no assumptions about the data, they can be applied to many modes such as images, audio, text, etc. with almost no changes.
Of course, the ideal is very full, but the reality is very skinny. The ViT model is large in scale and has a large delay. Running this complex model on a device with limited resources can be very problematic.
To address the problem of slow operation, researchers Multiple solutions are given. One of the common ways to speed up the vision Transformer model is to perform token pruning.
#Prune tokens at runtime to produce efficient Transformers by pruning less important tokens. For example, DynamicViT prunes redundant tokens hierarchically to reduce FLOPs in classification tasks.
However, token pruning has several problems, the most important of which is that pruning tokens will cause information loss. Therefore, people are not interested in ViT model tokens. The number of pruning is limited. In order to reduce information loss, only unimportant tokens can be pruned.
#Also, in order for the pruned token to be valid, one needs to train the model again. This results in additional resource consumption.
#More importantly, token pruning is a dynamic process, and different numbers of token pruning need to be determined based on different images or sentences. While this is good for improving accuracy, it is not practical enough because in this case the data can no longer be batch processed.
#In order to solve this problem, people need to add masks during the pruning process, which will further affect the efficiency improvement.
# Simply put, token pruning does make ViT run faster, but this is achieved at the cost of information loss.
How to make it ViT is similar in speed to pruning, but maintains higher accuracy than pruning? The Meta AI research team has come up with a new solution: Token Merging (ToMe).
Paper link: https://arxiv.org/pdf/2210.09461.pdf
Token Merging chooses to combine tokens instead of pruning them. Thanks to its custom matching algorithm, it is as fast as pruning while being more accurate. Plus, it works without requiring any additional training, so you can use it on huge models to speed them up without sacrificing a lot of accuracy.
The goal of Meta is to insert a Token Merging module into the existing ViT to improve the throughput of training and inference without requiring additional training by merging redundant tokens.
The basic idea is: in the Transformer model, through merging, each layer is reduced by r tokens. Suppose a Transformer model has L layers, then rL tokens can be reduced by merging. The size of the variable r determines the relationship between speed and accuracy, since fewer markers means lower accuracy but higher throughput.
#It is worth noting that in Token Merging, rL tags will be reduced regardless of the content of the image. This perfectly solves the problem of inability to perform batch processing in token pruning.
With ToMe, batches of similar tokens are merged in each Transformer block: for example, dog fur is merged into a single token.
Token Merging is inserted into every attention block and every Transformer block. This also contrasts with the workflow of token pruning. The latter tends to place the pruning step at the beginning of each Transformer block.
Through Token Merging, the information of tokens that need to be merged can be disseminated, and ViT can also use the attention block Characteristics to determine which tokens need to be merged.
The first step of merging It is determined to be similar tokens. Under the condition that QKV (query, key, value) in Transformer has been extracted, through ablation experiments, the research team found that using key can best measure the similarity between tokens (purple part in the figure below).
Because key has summarized the information contained in each token so that it can be used for dot-product in Attention. Measure the similarity between tokens.
In addition to studying which indicator is better for measuring token similarity, you also need to know what distance measures similarity. Through experiments, the research team found that using cosine distance to measure the similarity between tokes can achieve the best relationship between accuracy and speed.
After determining the similarity of tokens, a quick method is needed to determine which tokens need to match to reduce total r.
The Meta team does not use kmeans clustering algorithm or graph segmentation algorithm, but uses a matching algorithm, because the latter can not only accurately match the number of tokens in each layer , and can quickly perform thousands of matches. These cannot be accomplished by iterative clustering algorithms.
Therefore, the Meta team came up with a more efficient solution.
The design goals are as follows. 1.) avoid any iterations that cannot be parallelized, 2.) want the merged changes to be gradual, since clustering has no limit on how many markers can be merged into a group (which may adversely affect the network), while matching Then most tags are not merged.
Through this unique technology, the throughput and actual training speed of the ViT model can be improved. Using Token Merging can double the training speed. It can be used for image, video, and audio tasks and still achieve state-of-the-art accuracy.
The above is the detailed content of New ideas for accelerating ViT models! Meta launches Token Merging, which does not rely on pruning but merging. For more information, please follow other related articles on the PHP Chinese website!