Home >Technology peripherals >AI >Linearizing Attention
Large Language Models (LLMs) excel, but their softmax attention mechanism presents a computational bottleneck. This article explores alternatives to achieve linear time complexity.
Assuming familiarity with LLMs like ChatGPT and transformers, we focus on attention, the core of these models. Unlike RNNs, which compress past states into a hidden vector, attention selectively retrieves relevant past data for each new query. Transformers use key (K), query (Q), and value (V) embeddings. The attention mechanism matches queries against keys to retrieve values:
Softmax converts similarity scores to probabilities, similar to k-nearest neighbors.
The computational cost of a single attention layer is:
The quadratic complexity (O(N²)) of softmax becomes prohibitive for long sequences (N >> 100k).
Linear attention, proposed by Katharopoulos et al., cleverly rewrites the softmax exponential as a kernel function, enabling linear computation. The transformation is shown below:
The
elu(x) 1
function approximates the exponential. The computational cost becomes:
This is linear (O(Nd²)) when N >>> d, a common scenario in LLMs. A recurrent view is:
Softmax's inseparability prevents this linearization. During decoding, only S(n-1) needs tracking, resulting in O(d²) per token. However, the fixed-size S(n-1) limits context retention.
Gated linear attention addresses the memory limitation by selectively retaining information. The key change is in the formulation of S_n:
Various gating functions (G) exist, each leading to different models:
The gating function's dependence only on the current token allows for efficient parallel processing.
State Space Models (SSMs) offer a different perspective, treating sequences like CNNs process images. The model is a discrete linear time-invariant system:
This relates to convolution as:
H3 uses two complementary SSM layers:
SSMs' fixed parameters limit adaptability. Selective SSMs address this by making the system data-dependent:
Mamba utilizes selective SSMs with output gating and convolution:
This article traces the evolution of efficient sequence modeling, highlighting the trade-off between computational efficiency and memory capacity. Softmax's quadratic complexity contrasts with linear attention's efficiency, but the latter's limited memory leads to gated linear attention and SSMs. The progression towards data-dependent models (gated linear attention and selective SSMs) emphasizes the importance of adaptive information retention. Further reading is suggested in the cited papers.
References:
Katharopoulos et al. (2020), Yang et al. (2023), Fu et al. (2022), Gu & Dao (2023), Waleffe et al. (2024). (Note: Full citations are omitted for brevity but are available in the original input.)
Acknowledgement: (Acknowledgement section remains unchanged.)
The above is the detailed content of Linearizing Attention. For more information, please follow other related articles on the PHP Chinese website!