Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity

Home

Technology peripherals

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

王林

Apr 16, 2023 pm 07:25 PM

networkModelTsinghua University

Task universality is one of the core goals of basic model research, and it is also the only way for deep learning research to lead to advanced intelligence. In recent years, thanks to the universal key modeling capabilities of the attention mechanism, Transformer has performed well in many fields and has gradually shown a trend of universal architecture. However, as the length of the sequence increases, the calculation of the standard attention mechanism exhibits quadratic complexity, which seriously hinders its application in long sequence modeling and large models.

To this end, a team from the School of Software, Tsinghua University deeply explored this key issue and proposed a task-universal linear complexity backbone network Flowformer, while maintaining the versatility of the standard Transformer. At the same time, its complexity was reduced to linear, and the paper was accepted by ICML 2022.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

## Author list: Wu Haixu, Wu Jialong, Xu Jiehui, Wang Jianmin, Long Mingsheng

Link: https://arxiv.org/pdf/2202.06258.pdf

Code: https://github.com/thuml/ Flowformer

Compared with the standard Transformer, the Flowformer model proposed in this article has the following characteristics:

Linear complexity, can handle input sequences of thousands of lengths;
does not introduce new inductive preferences, maintaining the universality of the original attention mechanism Modeling ability;
Universal tasks, and achieved excellence in the five major tasks of long sequences, vision, natural language, time series, and reinforcement learning Effect.

1. Problem analysis

The standard attention mechanism input contains three parts: queries(), keys() and values(), and its calculation method As follows: where is the attention weight matrix, and the final calculation result will be obtained by weighted fusion. The computational complexity of the above process is. It is noted that there have been many studies on the problem of continuous multiplication of multinomial matrices in classical algorithms. In particular, for the attention mechanism, we can use the associative law of matrix multiplication to achieve optimization, for example, the original quadratic complexity can be reduced to linear. But the function in the attention mechanism makes it impossible to apply the associative law directly. Therefore, how to remove functions in the attention mechanism is the key to achieving linear complexity. However, much recent work has demonstrated that functions play a key role in avoiding trivial attentional learning. In summary, we look forward to a model design solution that achieves the following goals: (1) remove functions; (2) avoid trivial attention; (3) maintain the versatility of the model.

2. Motivation

In view of goal (1), in previous work, the kernel method is often used to replace the function, that is, through approximate attention calculation (for non- linear function), but removing it directly would cause trivial attention. To this end, for goal (2), previous work had to introduce some inductive preferences, which limited the versatility of the model , and therefore did not meet goal (3), such as the locality assumption in cosFormer.

Competition mechanism in Softmax

In order to meet the above goals, we analyze it based on the basic properties of . We note that it was originally proposed to extend the "winner-take-all" maximum operation into a differentiable form. Therefore, thanks to its inherent "competition" mechanism, it can differentiate the attention weights between various tokens, thereby avoiding ordinary attention problems. Based on the above considerations, we try to introduce the competition mechanism into the attention mechanism design, so as to avoid the trivial attention problems caused by kernel method decomposition.

Competition mechanism in network flow

We pay attention to the "Conservation"## in the classic network flow (Flow network) model in graph theory. #(Conservation) is an important phenomenon, that is, the inflow of each node is equal to the outflow. Inspired by "Fixed resources will inevitably cause competition", in this article, we try to re-analyze the information flow in the classic attention mechanism from the perspective of network flow, and convert competition through conservation properties Introduce attention mechanism design to avoid ordinary attention problems. 3. Flowformer

3.1 Attention mechanism from the perspective of network flow

Inside the attention mechanism: the flow of information can be expressed as: from

Source (source, corresponding) is gathered to sink (sink, corresponding) based on the learned flow capacity (flow capacity, corresponding attention weight).

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

Outside the attention mechanism, the information of the source (v) comes from the upper layer of the network, and the information of the sink (R) will also be provided to the feed-forward layer below.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

3.2 Flow-Attention

Based on the above observations, we can from the inflow From the two perspectives of flow and outflow, we control the interaction between the attention mechanism and the external network to achieve "fixed resources", thereby causing competition within the source and sink respectively to avoid ordinary attention. Without loss of generality, we set the amount of interaction information between the attention mechanism and the external network to the default value 1.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

(1) The inflow conservation of the sink (R):

is not difficult to obtain. Before conservation, for the th sink, the amount of information flowing in is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 . In order to fix the amount of information flowing into each sink to unit 1, we introduce as a normalization in the calculation of the information flow (attention weight). After normalization, the inflow information amount of the th sink is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

#At this time, due to the conservation of the inflow of the sink, there is natural competition between the various sources (V) Relationship, we calculate the amount of information provided by each source (V) at this time, and we can get: the amount of information provided by each source under competition, which also represents the importance of each source.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

(2) Conservation of outflow from source (V): Similar to the aforementioned process, before conservation, for the source, the amount of information flowing out of it is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 . In order to fix the amount of information flowing out of each source to unit 1, we will introduce the calculation of the information flow (attention weight) as a normalization. After normalization, the amount of outflow information from the jth source is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 . At this time, due to the conservation of outflow from the source, there is a natural competition relationship between the sinks (). We calculate the amount of information received by each sink () at this time, and we can get: In the case of competition, the final required for each result is The amount of information received.

(3) Overall design

Based on the above results, we design the following Flow-Attention mechanism, specifically including competition (Competition), aggregation (Aggregation), and allocation (Allocation) three parts: Competition introduces the competition mechanism to highlight important information; Aggregation realizes linear complexity based on the matrix associative law; Allocation introduces the competition mechanism and transfers control to the next step. One layer of information. All operations in the above process have linear complexity. At the same time, the design of Flow-Attention only relies on the conservation principle in network flow and reintegrates information flow. Therefore, it does not introduce new inductive preferences, ensuring the versatility of the model. Flowformer is obtained by replacing the quadratic complexity Attention in the standard Transformer with Flow-Attention.

4. Experiments

This paper conducts extensive experiments on standard data sets:

covers Five major tasks: long sequence, vision, natural language, time series, and reinforcement learning;
examines two types of attention mechanisms: normal (Normal) and autoregressive tasks (Causal).
Covers input situations of various sequence lengths (20-4000).
Compares various baseline methods such as classic models in various fields, mainstream deep models, Transformer and its variants.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

As shown in the table below, Flowformer performed well on all five tasks, verifying the versatility of the model. Please see the paper for detailed experimental results.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

5. Analysis

In order to further explain the working principle of Flowformer, we conducted a visual experiment on the attention in the ImageNet classification task (corresponding to Flow-Attention), from which we can find:

If you only use the kernel method for decomposition, such as Linear Transformer, the model will be distracted and unable to effectively capture key areas;
Both classic Transformer and Flowformer can accurately capture the key positions of the image, but the latter has an advantage in computational complexity;
cosFormer introduces one-dimensional locality in the attention mechanism Hypothetically, the effect is outstanding on language tasks. But in images (unfolding 2D data into 1D sequences), it cannot be adapted to vision tasks without extending the locality assumption to two dimensions. This also confirms the advantage of the design method in Flowformer that "does not introduce new inductive preferences".

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

The above visualization shows that introducing competition into the attention mechanism design through Flow-Attention can effectively avoid trivial attention. More visualization experiments can be found in the paper.

6. Summary

The Flowformer proposed in this article introduces the conservation principle in network flow into the design, and naturally introduces the competition mechanism into the attention calculation, effectively avoiding It solves the trivial attention problem and maintains the versatility of the standard Transformer while achieving linear complexity. Flowformer has achieved excellent results in five major tasks: long sequence, vision, natural language, time series, and reinforcement learning. In addition, the design concept of "no special induction preference" in Flowformer is also inspiring to the research of general infrastructure. In future work, we will further explore the potential of Flowformer for large-scale pre-training.

The above is the detailed content of Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

A Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles