I wonder if Gemini 1.5 Pro uses this technology.
Google has made another big move and released the next generation Transformer model Infini-Transformer.
Infini-Transformer introduces an efficient way to scale Transformer-based large language models (LLMs) to infinitely long inputs without increasing memory and computational requirements. Using this technology, the researchers successfully increased the context length of a 1B model to 1 million; applied to the 8B model, the model can handle the 500K book summary task. The Transformer architecture has dominated the field of generative artificial intelligence since the publication of the groundbreaking research paper "Attention is All You Need" in 2017. Google's optimized design of Transformer has been relatively frequent recently. A few days ago, they updated the Transformer architecture and released Mixture-of-Depths (MoD), which changed the previous Transformer computing model. Within a few days, Google released this new study. Researchers who focus on the field of AI understand the importance of memory. It is the cornerstone of intelligence and can provide efficient computing for LLM. However, Transformer and Transformer-based LLM exhibit quadratic complexity in both memory usage and computation time due to the inherent characteristics of the attention mechanism, i.e., the attention mechanism in Transformer. For example, for a 500B model with a batch size of 512 and a context length of 2048, the memory footprint of the attention key-value (KV) state is 3TB. But in fact, the standard Transformer architecture sometimes needs to extend the LLM to longer sequences (such as 1 million tokens), which brings huge memory overhead, and as the context length increases, the deployment cost also increases. Based on this, Google has introduced an effective approach, the key component of which is a new attention technology called Infini-attention. Unlike traditional Transformers, which use local attention to discard old fragments and free up memory space for new fragments. Infini-attention adds compressive memory, which can store used old fragments in compressed memory. When output, the current context information and the information in the compressed memory will be aggregated, so the model can retrieve the complete context history. This method enables Transformer LLM to scale to infinitely long contexts with limited memory and process extremely long inputs for calculations in a streaming manner. Experiments show that the method outperforms the baseline on long-context language modeling benchmarks while reducing memory parameters by more than 100 times. The model achieves better perplexity when trained with 100K sequence length. In addition, the study found that the 1B model was fine-tuned on key instances of 5K sequence length, solving the 1M length problem. Finally, the paper shows that the 8B model with Infini-attention achieved new SOTA results on the 500K length book summary task after continuous pre-training and task fine-tuning. The contributions of this article are summarized as follows:
- Introduces a practical and powerful attention Force mechanism Infini-attention - with long-term compressed memory and local causal attention, can be used to effectively model long-term and short-term context dependencies;
- Infini-attention has a standard scaling dot product Attention (standard scaled dot-product attention) is minimally changed and is designed to support plug-and-play continuous pre-training and long-context adaptation;
- This approach enables Transformer LLM is capable of processing extremely long inputs in a streaming manner, scaling to infinitely long contexts with limited memory and computing resources.
- ## Paper link: https://arxiv.org/pdf/2404.07143.pdf
- Paper title: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Infini-attention enables Transformer LLM to efficiently handle infinitely long inputs with limited memory footprint and computation. As shown in Figure 1 below, Infini-attention incorporates compressed memory into the ordinary attention mechanism, and builds masked local attention and long-term linear attention mechanisms in a single Transformer block.
This subtle but critical modification to the Transformer attention layer can extend the context window of existing LLMs to infinite lengths through continuous pre-training and fine-tuning.
Infini-attention takes all keys, values, and query states of standard attention calculations for long-term memory consolidation and retrieval, and transfers the attention's old KV states are stored in compressed memory instead of discarding them like standard attention mechanisms.When processing subsequent sequences, Infini-attention uses the attention query state to retrieve values from memory. To compute the final context output, Infini-attention aggregates long-term memory retrieval values and local attention context.
As shown in Figure 2 below, the research team compared Infini-Transformer and Transformer-XL based on Infini-attention. Similar to Transformer-XL, Infini-Transformer operates on a sequence of segments and computes the standard causal dot product attention context in each segment. Therefore, the dot product attention computation is local in some sense. However, local attention discards the attention state of the previous segment when processing the next segment, but Infini-Transformer reuses the old KV attention state to Maintain the entire context history via compressed storage. Therefore, each attention layer of Infini-Transformer has a global compressed state and a local fine-grained state. Similar to multi-head attention (MHA), in addition to dot product attention, Infini-attention also maintains H parallel compressed memories (H is the number of attention heads). Table 1 below lists the context memory footprint and effective context length defined by several models based on model parameters and input segment length. Infini-Transformer supports infinite context windows with limited memory footprint. This research is based on long context language modeling with a length of 1M. The Infini-Transformer model is evaluated on key context block retrieval and 500K length book summarization tasks, which have extremely long input sequences. For language modeling, the researchers chose to train the model from scratch, while for the key and book summary tasks, the researchers used continuous pre-training of LLM to prove Infini-attention's plug-and-play long-context adaptability. Long context language modeling. Table 2 results show that Infini-Transformer outperforms Transformer-XL and Memorizing Transformers baselines and stores 114x fewer parameters compared to the Memorizing Transformer model. Key tasks. Table 3 shows the Infini-Transformer fine-tuned on a 5K length input solving the key task up to 1M context length. The input tokens in the experiment ranged from 32K to 1M. For each test subset, the researchers controlled the position of the key so that it was located near the beginning, middle, or end of the input sequence. Experiments report zero-shot accuracy and fine-tuning accuracy. After 400 steps of fine-tuning on a 5K length input, Infini-Transformer solves tasks up to 1M context length. Summary tasks. Table 4 compares Infini-Transformer with an encoder-decoder model built specifically for the summarization task. The results show that Infini-Transformer surpasses the previous best results and achieves new SOTA on BookSum by processing the entire text of the book. #The researchers also plotted the overall Rouge score for the BookSum data validation split in Figure 4. The polyline trend shows that Infini-Transformers improve summary performance metrics as the input length increases. The above is the detailed content of Directly expands to infinite length, Google Infini-Transformer ends the context length debate. For more information, please follow other related articles on the PHP Chinese website!