Home >Technology peripherals >AI >Linearizing Attention

Linearizing Attention

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal: 2025-02-25 19:10:11110browse

Large Language Models (LLMs) excel, but their softmax attention mechanism presents a computational bottleneck. This article explores alternatives to achieve linear time complexity.

Attention Fundamentals

Assuming familiarity with LLMs like ChatGPT and transformers, we focus on attention, the core of these models. Unlike RNNs, which compress past states into a hidden vector, attention selectively retrieves relevant past data for each new query. Transformers use key (K), query (Q), and value (V) embeddings. The attention mechanism matches queries against keys to retrieve values:

Linearizing Attention Softmax converts similarity scores to probabilities, similar to k-nearest neighbors.

The computational cost of a single attention layer is:

Linearizing Attention The quadratic complexity (O(N²)) of softmax becomes prohibitive for long sequences (N >> 100k).

Linear Attention: A Solution?

Linear attention, proposed by Katharopoulos et al., cleverly rewrites the softmax exponential as a kernel function, enabling linear computation. The transformation is shown below:

Linearizing Attention The elu(x) 1 function approximates the exponential. The computational cost becomes:

Linearizing Attention This is linear (O(Nd²)) when N >>> d, a common scenario in LLMs. A recurrent view is:

Linearizing Attention Softmax's inseparability prevents this linearization. During decoding, only S(n-1) needs tracking, resulting in O(d²) per token. However, the fixed-size S(n-1) limits context retention.

Gated Linear Attention: Strategic Memory

Gated linear attention addresses the memory limitation by selectively retaining information. The key change is in the formulation of S_n:

Linearizing Attention Various gating functions (G) exist, each leading to different models:

Linearizing Attention The gating function's dependence only on the current token allows for efficient parallel processing.

State Space Models: A Convolutional Approach

State Space Models (SSMs) offer a different perspective, treating sequences like CNNs process images. The model is a discrete linear time-invariant system:

Linearizing Attention This relates to convolution as:

Linearizing Attention H3 uses two complementary SSM layers:

Linearizing Attention

Selective State Space Models: Data-Dependent Dynamics

SSMs' fixed parameters limit adaptability. Selective SSMs address this by making the system data-dependent:

Linearizing Attention Mamba utilizes selective SSMs with output gating and convolution:

Linearizing Attention

Conclusion

This article traces the evolution of efficient sequence modeling, highlighting the trade-off between computational efficiency and memory capacity. Softmax's quadratic complexity contrasts with linear attention's efficiency, but the latter's limited memory leads to gated linear attention and SSMs. The progression towards data-dependent models (gated linear attention and selective SSMs) emphasizes the importance of adaptive information retention. Further reading is suggested in the cited papers.

References:

Katharopoulos et al. (2020), Yang et al. (2023), Fu et al. (2022), Gu & Dao (2023), Waleffe et al. (2024). (Note: Full citations are omitted for brevity but are available in the original input.)

Acknowledgement: (Acknowledgement section remains unchanged.)

The above is the detailed content of Linearizing Attention. For more information, please follow other related articles on the PHP Chinese website!

for Token function this input chatgpt excel Papers

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Linearizing LlamaNext article：Linearizing Llama

See more

Linearizing Attention

Attention Fundamentals

Linear Attention: A Solution?

Gated Linear Attention: Strategic Memory

State Space Models: A Convolutional Approach

Selective State Space Models: Data-Dependent Dynamics

Conclusion

Related articles