Home >Technology peripherals >AI >Linearizing Attention

Linearizing Attention

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal
2025-02-25 19:10:11110browse

Large Language Models (LLMs) excel, but their softmax attention mechanism presents a computational bottleneck. This article explores alternatives to achieve linear time complexity.

Linearizing AttentionAttention Fundamentals

Assuming familiarity with LLMs like ChatGPT and transformers, we focus on attention, the core of these models. Unlike RNNs, which compress past states into a hidden vector, attention selectively retrieves relevant past data for each new query. Transformers use key (K), query (Q), and value (V) embeddings. The attention mechanism matches queries against keys to retrieve values:

Linearizing AttentionSoftmax converts similarity scores to probabilities, similar to k-nearest neighbors.

The computational cost of a single attention layer is:

Linearizing AttentionThe quadratic complexity (O(N²)) of softmax becomes prohibitive for long sequences (N >> 100k).

Linear Attention: A Solution?

Linear attention, proposed by Katharopoulos et al., cleverly rewrites the softmax exponential as a kernel function, enabling linear computation. The transformation is shown below:

Linearizing AttentionThe elu(x) 1 function approximates the exponential. The computational cost becomes:

Linearizing AttentionThis is linear (O(Nd²)) when N >>> d, a common scenario in LLMs. A recurrent view is:

Linearizing AttentionSoftmax's inseparability prevents this linearization. During decoding, only S(n-1) needs tracking, resulting in O(d²) per token. However, the fixed-size S(n-1) limits context retention.

Gated Linear Attention: Strategic Memory

Gated linear attention addresses the memory limitation by selectively retaining information. The key change is in the formulation of S_n:

Linearizing AttentionVarious gating functions (G) exist, each leading to different models:

Linearizing AttentionThe gating function's dependence only on the current token allows for efficient parallel processing.

State Space Models: A Convolutional Approach

State Space Models (SSMs) offer a different perspective, treating sequences like CNNs process images. The model is a discrete linear time-invariant system:

Linearizing AttentionThis relates to convolution as:

Linearizing AttentionH3 uses two complementary SSM layers:

Linearizing Attention

Selective State Space Models: Data-Dependent Dynamics

SSMs' fixed parameters limit adaptability. Selective SSMs address this by making the system data-dependent:

Linearizing AttentionMamba utilizes selective SSMs with output gating and convolution:

Linearizing Attention

Conclusion

This article traces the evolution of efficient sequence modeling, highlighting the trade-off between computational efficiency and memory capacity. Softmax's quadratic complexity contrasts with linear attention's efficiency, but the latter's limited memory leads to gated linear attention and SSMs. The progression towards data-dependent models (gated linear attention and selective SSMs) emphasizes the importance of adaptive information retention. Further reading is suggested in the cited papers.

References:

Katharopoulos et al. (2020), Yang et al. (2023), Fu et al. (2022), Gu & Dao (2023), Waleffe et al. (2024). (Note: Full citations are omitted for brevity but are available in the original input.)

Acknowledgement: (Acknowledgement section remains unchanged.)

The above is the detailed content of Linearizing Attention. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:Linearizing LlamaNext article:Linearizing Llama