Home > Article > Technology peripherals > A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers
Transformer was originally designed for natural language processing tasks, but has now been widely used in vision tasks. Vision Transformer has demonstrated excellent accuracy in multiple visual recognition tasks and achieved state-of-the-art performance in tasks such as image classification, video classification, and target detection
Visual A major disadvantage of Transformer is its high computational cost. Typical convolutional networks (CNN) require tens of GFlops per image, while visual Transformers often require an order of magnitude more, reaching hundreds of GFlops per image. When processing video, this problem is even more severe due to the huge amount of data. The high computational cost makes it difficult for visual Transformers to be deployed on devices with limited resources or strict latency requirements, which limits the application scenarios of this technology, otherwise we would already have some exciting applications.
In a recent paper, Matthew Dutson, Yin Li, and Mohit Gupta, three researchers at the University of Wisconsin-Madison, first proposed that temporal redundancy can be used between subsequent inputs. Reduce the cost of visual Transformer in video applications. They also released the model code, which includes the PyTorch module used to build the Eventful Transformer.
Time redundancy: First Suppose there is a visual Transformer that can process a video sequence frame by frame or video clip by video clip. This Transformer may be a simple frame-by-frame processing model (such as an object detector) or an intermediate step of a spatiotemporal model (such as the first step of ViViT's decomposed model). Unlike the language processing Transformer, where one input is a complete sequence, the researchers here provide multiple different inputs (frames or video clips) to the Transformer over time.
Natural videos contain significant temporal redundancy, i.e. the differences between subsequent frames are small. Nonetheless, deep networks, including Transformers, typically compute each frame “from scratch.” This method discards potentially relevant information obtained through previous reasoning, which is extremely wasteful. Therefore, these three researchers imagined: Can the intermediate calculation results of previous calculation steps be reused to improve the efficiency of processing redundant sequences?
Adaptive inference: For visual Transformers, and deep networks in general, the cost of inference is often dictated by the architecture. However, in real applications, available resources may change over time, for example due to competing processes or power changes. As a result, there may be a need to modify the model calculation cost at runtime. One of the main design goals set by the researchers in this new effort was adaptability—their approach allowed for real-time control over computational costs. Figure 1 below (bottom) gives an example of modifying the computational budget during video processing.
Event-based Transformer: This article proposes an event-based Transformer that can utilize the temporal redundancy between inputs to achieve efficient and adaptive reasoning. The term eventization is inspired by event cameras, sensors that discretely record images as the scene changes. The event-based Transformer tracks token-level changes over time and selectively updates the token representation and self-attention map at each time step. The event-based Transformer module contains a gating module to control the number of update tokens
This method is suitable for existing models (usually without retraining), and is suitable for for many video processing tasks. The researchers also conducted experiments to demonstrate that the Eventful Transformer can be used on the best existing models while greatly reducing computational costs and maintaining the original accuracy
Rewritten content: The goal of this research is to accelerate the visual Transformer for video recognition. In this scenario, the visual Transformer needs to repeatedly process video frames or video clips. Specific tasks include video target detection and video action recognition. The key idea proposed is to exploit temporal redundancy, i.e., reuse calculation results from previous time steps. The following will describe in detail how to modify the Transformer module to have the ability to sense time redundancy
Token Gating: Detecting Redundancy
This section will introduce two new modules proposed by researchers: token gate and token buffer. These modules enable the model to identify and update tokens that have significantly changed since the last update
Gate module: This gate selects a portion M from the input token N and sends it to the downstream layer to perform calculations . It maintains a reference token set in its memory, denoted as u. This reference vector contains the value of each token at the time of its most recent update. At each time step, each token is compared with its corresponding reference value, and the token that is significantly different from the reference value is updated.
Now mark the current input of this gate as c. At each time step, follow the following process to update the gate's status and determine its output (see Figure 2 below):
1. Calculate the total error e = u − c.
2. Use a selection strategy for error e. The selection strategy returns a binary mask m (equivalent to a token index list), indicating which M tokens should be updated.
3. Extract the token selected by the above strategy. This is described in Figure 2 as the product c × m; in practice it is achieved by performing a "gather" operation along the first axis of c. The collected tokens are recorded here as , which is the output of the gate.
4. Update the reference token to the selected token. Figure 2 describes this process as ; the operation used in practice is "scatter". In the first time step, the gate updates all tokens (initializing u ← c and returning c˜ = c).
Buffer module: The buffer module maintains a state tensor , which tracks each input token
, the buffer disperses the tokens from f (c˜) to their corresponding positions in b. It then returns the updated b as its output, see Figure 3 below.
#The researchers paired each gate with the buffer behind it. The following is a simple usage pattern: the output of the gate
is passed to a series of operations f (c˜) for each token; then the resulting tensor Passed to a buffer, which will restore the full shape.
Reconstruct the redundant-aware Transformer
In order to take advantage of the above time redundancy, the researcher proposed a A modification scheme to the Transformer module. Figure 4 below shows the design of the Eventful Transformer module. This method can speed up operations on individual tokens (such as MLP) as well as query-key-value and attention-value multiplication.
In the Transformer module that operates on each token, many operations are performed on each token, which means they do not involve information exchange between tokens, including linear transformations in MLP and MSA. In order to save computational costs, the researchers stated that token-oriented operations on tokens not selected by the gate can be skipped. Due to the independence between tokens, this does not change the result of the operation on the selected token. See Figure 3.
Specifically, the researchers used a continuous sequence of a pair of gate-buffers when processing the operations of each token, including W_qkv transformation, W_p transformation and MLP. It should be noted that before skip connection, they also added a buffer to ensure that the tokens of the two addition operands can be correctly aligned
For the operation cost of each token Proportional to the number of tokens. By reducing the number from N to M, the downstream operation cost per token will be reduced by N/M times
Now let's look at the query-key-value product B = q k The result of ^T
Figure 5 below shows the method of sparsely updating a part of the elements in the query-key-value product B.
The overall cost of these updates is 2NMD, compared to the cost of computing B from scratch, which is N^2D. Note that the cost of the new method is proportional to M, the number of tokens chosen. When M
Attention - product of values: The researcher proposed this An update strategy based on delta Δ is proposed.
Figure 6 shows a newly proposed method to efficiently calculate three incremental terms
When M is less than half of N, the amount of calculation can be reduced
token selection strategy
One of Eventful Transformer The most important design is its token selection strategy. Given a gate error tensor e, the goal of such a policy is to generate a mask m indicating the tokens that should be updated. Specific strategies include:
Top-r strategy: This strategy selects r tokens with the largest error e (the L2 norm is used here).
Threshold strategy: This strategy will select all tokens whose norm of error e exceeds the threshold h
Rewritten content: Others Strategy: Better accuracy-cost trade-offs can be achieved by adopting more sophisticated token selection strategies, such as using a lightweight policy network to learn the strategy. However, training the decision-making mechanism of the policy may face difficulties because the binary mask m is usually non-differentiable. Another idea is to use the importance score as reference information for selection. However, these ideas still require further investigation
The researchers conducted an experimental evaluation of the newly proposed method, specifically applied to video targets Detection and video action recognition tasks
Figure 7 below shows the experimental results of video target detection. where the positive axis is the computational savings rate and the negative axis is the relative reduction in mAP50 score for the new method. It can be seen that the new method achieves significant computational savings with a small sacrifice of accuracy.
Figure 8 below shows the method comparison and ablation experimental results for the video target detection task
Figure 9 below shows the experimental results of video action recognition.
In Table 2 below, the time results (in milliseconds) are shown for running on one CPU (Xeon Silver 4214, 2.2 GHz) and one GPU (NVIDIA RTX3090). It can be observed that the temporal redundancy on the GPU brings a 1.74 times speed improvement, while the improvement on the CPU reaches 2.47 times
For more technical details and experimental results, please refer to the original paper.
The above is the detailed content of A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers. For more information, please follow other related articles on the PHP Chinese website!