A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers-AI-php.cn

Home

Technology peripherals

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Oct 06, 2023 pm 02:45 PM

datatrain

Transformer was originally designed for natural language processing tasks, but has now been widely used in vision tasks. Vision Transformer has demonstrated excellent accuracy in multiple visual recognition tasks and achieved state-of-the-art performance in tasks such as image classification, video classification, and target detection

Visual A major disadvantage of Transformer is its high computational cost. Typical convolutional networks (CNN) require tens of GFlops per image, while visual Transformers often require an order of magnitude more, reaching hundreds of GFlops per image. When processing video, this problem is even more severe due to the huge amount of data. The high computational cost makes it difficult for visual Transformers to be deployed on devices with limited resources or strict latency requirements, which limits the application scenarios of this technology, otherwise we would already have some exciting applications.

In a recent paper, Matthew Dutson, Yin Li, and Mohit Gupta, three researchers at the University of Wisconsin-Madison, first proposed that temporal redundancy can be used between subsequent inputs. Reduce the cost of visual Transformer in video applications. They also released the model code, which includes the PyTorch module used to build the Eventful Transformer.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

##Paper address: https://arxiv.org/pdf/2308.13494.pdf
Project address: http://wisionlab.com/project/eventful-transformers

Time redundancy: First Suppose there is a visual Transformer that can process a video sequence frame by frame or video clip by video clip. This Transformer may be a simple frame-by-frame processing model (such as an object detector) or an intermediate step of a spatiotemporal model (such as the first step of ViViT's decomposed model). Unlike the language processing Transformer, where one input is a complete sequence, the researchers here provide multiple different inputs (frames or video clips) to the Transformer over time.

Natural videos contain significant temporal redundancy, i.e. the differences between subsequent frames are small. Nonetheless, deep networks, including Transformers, typically compute each frame “from scratch.” This method discards potentially relevant information obtained through previous reasoning, which is extremely wasteful. Therefore, these three researchers imagined: Can the intermediate calculation results of previous calculation steps be reused to improve the efficiency of processing redundant sequences?

Adaptive inference: For visual Transformers, and deep networks in general, the cost of inference is often dictated by the architecture. However, in real applications, available resources may change over time, for example due to competing processes or power changes. As a result, there may be a need to modify the model calculation cost at runtime. One of the main design goals set by the researchers in this new effort was adaptability—their approach allowed for real-time control over computational costs. Figure 1 below (bottom) gives an example of modifying the computational budget during video processing.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

Event-based Transformer: This article proposes an event-based Transformer that can utilize the temporal redundancy between inputs to achieve efficient and adaptive reasoning. The term eventization is inspired by event cameras, sensors that discretely record images as the scene changes. The event-based Transformer tracks token-level changes over time and selectively updates the token representation and self-attention map at each time step. The event-based Transformer module contains a gating module to control the number of update tokens

This method is suitable for existing models (usually without retraining), and is suitable for for many video processing tasks. The researchers also conducted experiments to demonstrate that the Eventful Transformer can be used on the best existing models while greatly reducing computational costs and maintaining the original accuracy

Eventful Transformer

Rewritten content: The goal of this research is to accelerate the visual Transformer for video recognition. In this scenario, the visual Transformer needs to repeatedly process video frames or video clips. Specific tasks include video target detection and video action recognition. The key idea proposed is to exploit temporal redundancy, i.e., reuse calculation results from previous time steps. The following will describe in detail how to modify the Transformer module to have the ability to sense time redundancy

Token Gating: Detecting Redundancy

This section will introduce two new modules proposed by researchers: token gate and token buffer. These modules enable the model to identify and update tokens that have significantly changed since the last update

Gate module: This gate selects a portion M from the input token N and sends it to the downstream layer to perform calculations . It maintains a reference token set in its memory, denoted as u. This reference vector contains the value of each token at the time of its most recent update. At each time step, each token is compared with its corresponding reference value, and the token that is significantly different from the reference value is updated.

Now mark the current input of this gate as c. At each time step, follow the following process to update the gate's status and determine its output (see Figure 2 below):

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

1. Calculate the total error e = u − c.

2. Use a selection strategy for error e. The selection strategy returns a binary mask m (equivalent to a token index list), indicating which M tokens should be updated.

3. Extract the token selected by the above strategy. This is described in Figure 2 as the product c × m; in practice it is achieved by performing a "gather" operation along the first axis of c. The collected tokens are recorded here as A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers , which is the output of the gate.

4. Update the reference token to the selected token. Figure 2 describes this process as A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers ; the operation used in practice is "scatter". In the first time step, the gate updates all tokens (initializing u ← c and returning c˜ = c).

Buffer module: The buffer module maintains a state tensor A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers , which tracks each input token

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers , the buffer disperses the tokens from f (c˜) to their corresponding positions in b. It then returns the updated b as its output, see Figure 3 below.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

#The researchers paired each gate with the buffer behind it. The following is a simple usage pattern: the output of the gate

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers is passed to a series of operations f (c˜) for each token; then the resulting tensor Passed to a buffer, which will restore the full shape.

Reconstruct the redundant-aware Transformer

In order to take advantage of the above time redundancy, the researcher proposed a A modification scheme to the Transformer module. Figure 4 below shows the design of the Eventful Transformer module. This method can speed up operations on individual tokens (such as MLP) as well as query-key-value and attention-value multiplication.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

In the Transformer module that operates on each token, many operations are performed on each token, which means they do not involve information exchange between tokens, including linear transformations in MLP and MSA. In order to save computational costs, the researchers stated that token-oriented operations on tokens not selected by the gate can be skipped. Due to the independence between tokens, this does not change the result of the operation on the selected token. See Figure 3.

Specifically, the researchers used a continuous sequence of a pair of gate-buffers when processing the operations of each token, including W_qkv transformation, W_p transformation and MLP. It should be noted that before skip connection, they also added a buffer to ensure that the tokens of the two addition operands can be correctly aligned

For the operation cost of each token Proportional to the number of tokens. By reducing the number from N to M, the downstream operation cost per token will be reduced by N/M times

Now let's look at the query-key-value product B = q k The result of ^T

Figure 5 below shows the method of sparsely updating a part of the elements in the query-key-value product B.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

The overall cost of these updates is 2NMD, compared to the cost of computing B from scratch, which is N^2D. Note that the cost of the new method is proportional to M, the number of tokens chosen. When M

Attention - product of values: The researcher proposed this An update strategy based on delta Δ is proposed.

Figure 6 shows a newly proposed method to efficiently calculate three incremental terms

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

When M is less than half of N, the amount of calculation can be reduced

token selection strategy

One of Eventful Transformer The most important design is its token selection strategy. Given a gate error tensor e, the goal of such a policy is to generate a mask m indicating the tokens that should be updated. Specific strategies include:

Top-r strategy: This strategy selects r tokens with the largest error e (the L2 norm is used here).

Threshold strategy: This strategy will select all tokens whose norm of error e exceeds the threshold h

Rewritten content: Others Strategy: Better accuracy-cost trade-offs can be achieved by adopting more sophisticated token selection strategies, such as using a lightweight policy network to learn the strategy. However, training the decision-making mechanism of the policy may face difficulties because the binary mask m is usually non-differentiable. Another idea is to use the importance score as reference information for selection. However, these ideas still require further investigation

Experiments

The researchers conducted an experimental evaluation of the newly proposed method, specifically applied to video targets Detection and video action recognition tasks

Figure 7 below shows the experimental results of video target detection. where the positive axis is the computational savings rate and the negative axis is the relative reduction in mAP50 score for the new method. It can be seen that the new method achieves significant computational savings with a small sacrifice of accuracy.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

Figure 8 below shows the method comparison and ablation experimental results for the video target detection task

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

Figure 9 below shows the experimental results of video action recognition.

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

In Table 2 below, the time results (in milliseconds) are shown for running on one CPU (Xeon Silver 4214, 2.2 GHz) and one GPU (NVIDIA RTX3090). It can be observed that the temporal redundancy on the GPU brings a 1.74 times speed improvement, while the improvement on the CPU reaches 2.47 times

A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers

For more technical details and experimental results, please refer to the original paper.

The above is the detailed content of A surprising approach to temporal redundancy: a new way to reduce the computational cost of visual Transformers. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

What is Graph of Thought in Prompt EngineeringApr 13, 2025 am 11:53 AM

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Optimize Your Organisation's Email Marketing with GenAI AgentsApr 13, 2025 am 11:44 AM

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Real-Time App Performance Monitoring with Apache PinotApr 13, 2025 am 11:40 AM

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

ChatGPT Hits 1 Billion Users? 'Doubled In Just Weeks' Says OpenAI CEOApr 13, 2025 am 11:23 AM

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics VidhyaApr 13, 2025 am 11:20 AM

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Agentic Frameworks for Generative AI Applications - Analytics VidhyaApr 13, 2025 am 11:13 AM

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Applications of Generative AI in the Financial SectorApr 13, 2025 am 11:12 AM

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Guide to Online Learning and Passive-Aggressive AlgorithmsApr 13, 2025 am 11:09 AM

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6

Visual web development tools

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Hot Topics

Where is the login entrance for gmail email?

7488

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers