Home >Technology peripherals >AI >Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

王林
王林forward
2023-04-16 19:25:011394browse

Task universality is one of the core goals of basic model research, and it is also the only way for deep learning research to lead to advanced intelligence. In recent years, thanks to the universal key modeling capabilities of the attention mechanism, Transformer has performed well in many fields and has gradually shown a trend of universal architecture. However, as the length of the sequence increases, the calculation of the standard attention mechanism exhibits quadratic complexity, which seriously hinders its application in long sequence modeling and large models.

To this end, a team from the School of Software, Tsinghua University deeply explored this key issue and proposed a task-universal linear complexity backbone network Flowformer, while maintaining the versatility of the standard Transformer. At the same time, its complexity was reduced to linear, and the paper was accepted by ICML 2022.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

## Author list: Wu Haixu, Wu Jialong, Xu Jiehui, Wang Jianmin, Long Mingsheng

Link: https://arxiv.org/pdf/2202.06258.pdf

Code: https://github.com/thuml/ Flowformer

Compared with the standard Transformer, the Flowformer model proposed in this article has the following characteristics:

  • Linear complexity, can handle input sequences of thousands of lengths;
  • does not introduce new inductive preferences, maintaining the universality of the original attention mechanism Modeling ability;
  • Universal tasks, and achieved excellence in the five major tasks of long sequences, vision, natural language, time series, and reinforcement learning Effect.
1. Problem analysis

The standard attention mechanism input contains three parts: queries(), keys() and values(), and its calculation method As follows: where is the attention weight matrix, and the final calculation result will be obtained by weighted fusion. The computational complexity of the above process is. It is noted that there have been many studies on the problem of continuous multiplication of multinomial matrices in classical algorithms. In particular, for the attention mechanism, we can use the associative law of matrix multiplication to achieve optimization, for example, the original quadratic complexity can be reduced to linear. But the function in the attention mechanism makes it impossible to apply the associative law directly. Therefore, how to remove functions in the attention mechanism is the key to achieving linear complexity. However, much recent work has demonstrated that functions play a key role in avoiding trivial attentional learning. In summary, we look forward to a model design solution that achieves the following goals: (1) remove functions; (2) avoid trivial attention; (3) maintain the versatility of the model.

2. Motivation

In view of goal (1), in previous work, the kernel method is often used to replace the function, that is, through approximate attention calculation (for non- linear function), but removing it directly would cause trivial attention. To this end, for goal (2), previous work had to introduce some inductive preferences, which limited the versatility of the model , and therefore did not meet goal (3), such as the locality assumption in cosFormer.

Competition mechanism in Softmax

In order to meet the above goals, we analyze it based on the basic properties of . We note that it was originally proposed to extend the "winner-take-all" maximum operation into a differentiable form. Therefore, thanks to its inherent "competition" mechanism, it can differentiate the attention weights between various tokens, thereby avoiding ordinary attention problems. Based on the above considerations, we try to introduce the competition mechanism into the attention mechanism design, so as to avoid the trivial attention problems caused by kernel method decomposition.

Competition mechanism in network flow

We pay attention to the "Conservation"## in the classic network flow (Flow network) model in graph theory. #(Conservation) is an important phenomenon, that is, the inflow of each node is equal to the outflow. Inspired by "Fixed resources will inevitably cause competition", in this article, we try to re-analyze the information flow in the classic attention mechanism from the perspective of network flow, and convert competition through conservation properties Introduce attention mechanism design to avoid ordinary attention problems. 3. Flowformer

3.1 Attention mechanism from the perspective of network flow

Inside the attention mechanism: the flow of information can be expressed as: from

Source (source, corresponding) is gathered to sink (sink, corresponding) based on the learned flow capacity (flow capacity, corresponding attention weight).

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

Outside the attention mechanism, the information of the source (v) comes from the upper layer of the network, and the information of the sink (R) will also be provided to the feed-forward layer below.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

3.2 Flow-Attention

Based on the above observations, we can from the inflow From the two perspectives of flow and outflow, we control the interaction between the attention mechanism and the external network to achieve "fixed resources", thereby causing competition within the source and sink respectively to avoid ordinary attention. Without loss of generality, we set the amount of interaction information between the attention mechanism and the external network to the default value 1.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

(1) The inflow conservation of the sink (R):

is not difficult to obtain. Before conservation, for the th sink, the amount of information flowing in is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. In order to fix the amount of information flowing into each sink to unit 1, we introduce Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022 as a normalization in the calculation of the information flow (attention weight). After normalization, the inflow information amount of the th sink is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

#At this time, due to the conservation of the inflow of the sink, there is natural competition between the various sources (V) Relationship, we calculate the amount of information provided by each source (V) at this time, and we can get: the amount of information provided by each source under competition, which also represents the importance of each source.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

(2) Conservation of outflow from source (V): Similar to the aforementioned process, before conservation, for the source, the amount of information flowing out of it is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. In order to fix the amount of information flowing out of each source to unit 1, we will introduce the calculation of the information flow (attention weight) as a normalization. After normalization, the amount of outflow information from the jth source is: Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. At this time, due to the conservation of outflow from the source, there is a natural competition relationship between the sinks (). We calculate the amount of information received by each sink () at this time, and we can get: In the case of competition, the final required for each result is The amount of information received.

(3) Overall design

Based on the above results, we design the following Flow-Attention mechanism, specifically including competition (Competition), aggregation (Aggregation), and allocation (Allocation) three parts: Competition introduces the competition mechanism to highlight important information; Aggregation realizes linear complexity based on the matrix associative law; Allocation introduces the competition mechanism and transfers control to the next step. One layer of information. All operations in the above process have linear complexity. At the same time, the design of Flow-Attention only relies on the conservation principle in network flow and reintegrates information flow. Therefore, it does not introduce new inductive preferences, ensuring the versatility of the model. Flowformer is obtained by replacing the quadratic complexity Attention in the standard Transformer with Flow-Attention.

4. Experiments

This paper conducts extensive experiments on standard data sets:

  • covers Five major tasks: long sequence, vision, natural language, time series, and reinforcement learning;
  • examines two types of attention mechanisms: normal (Normal) and autoregressive tasks (Causal).
  • Covers input situations of various sequence lengths (20-4000).
  • Compares various baseline methods such as classic models in various fields, mainstream deep models, Transformer and its variants.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

As shown in the table below, Flowformer performed well on all five tasks, verifying the versatility of the model. Please see the paper for detailed experimental results.

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

5. Analysis

In order to further explain the working principle of Flowformer, we conducted a visual experiment on the attention in the ImageNet classification task (corresponding to Flow-Attention), from which we can find:

  • If you only use the kernel method for decomposition, such as Linear Transformer, the model will be distracted and unable to effectively capture key areas;
  • Both classic Transformer and Flowformer can accurately capture the key positions of the image, but the latter has an advantage in computational complexity;
  • cosFormer introduces one-dimensional locality in the attention mechanism Hypothetically, the effect is outstanding on language tasks. But in images (unfolding 2D data into 1D sequences), it cannot be adapted to vision tasks without extending the locality assumption to two dimensions. This also confirms the advantage of the design method in Flowformer that "does not introduce new inductive preferences".

Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022

The above visualization shows that introducing competition into the attention mechanism design through Flow-Attention can effectively avoid trivial attention. More visualization experiments can be found in the paper.

6. Summary

The Flowformer proposed in this article introduces the conservation principle in network flow into the design, and naturally introduces the competition mechanism into the attention calculation, effectively avoiding It solves the trivial attention problem and maintains the versatility of the standard Transformer while achieving linear complexity. Flowformer has achieved excellent results in five major tasks: long sequence, vision, natural language, time series, and reinforcement learning. In addition, the design concept of "no special induction preference" in Flowformer is also inspiring to the research of general infrastructure. In future work, we will further explore the potential of Flowformer for large-scale pre-training.

The above is the detailed content of Common tasks! Tsinghua proposes backbone network Flowformer to achieve linear complexity | ICML2022. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete