search
HomeTechnology peripheralsAIDirectly expands to infinite length, Google Infini-Transformer ends the context length debate

I wonder if Gemini 1.5 Pro uses this technology.

Google has made another big move and released the next generation Transformer model Infini-Transformer.

Infini-Transformer introduces an efficient way to scale Transformer-based large language models (LLMs) to infinitely long inputs without increasing memory and computational requirements. Using this technology, the researchers successfully increased the context length of a 1B model to 1 million; applied to the 8B model, the model can handle the 500K book summary task.

The Transformer architecture has dominated the field of generative artificial intelligence since the publication of the groundbreaking research paper "Attention is All You Need" in 2017. Google's optimized design of Transformer has been relatively frequent recently. A few days ago, they updated the Transformer architecture and released Mixture-of-Depths (MoD), which changed the previous Transformer computing model. Within a few days, Google released this new study.

Researchers who focus on the field of AI understand the importance of memory. It is the cornerstone of intelligence and can provide efficient computing for LLM. However, Transformer and Transformer-based LLM exhibit quadratic complexity in both memory usage and computation time due to the inherent characteristics of the attention mechanism, i.e., the attention mechanism in Transformer. For example, for a 500B model with a batch size of 512 and a context length of 2048, the memory footprint of the attention key-value (KV) state is 3TB. But in fact, the standard Transformer architecture sometimes needs to extend the LLM to longer sequences (such as 1 million tokens), which brings huge memory overhead, and as the context length increases, the deployment cost also increases.

Based on this, Google has introduced an effective approach, the key component of which is a new attention technology called Infini-attention. Unlike traditional Transformers, which use local attention to discard old fragments and free up memory space for new fragments. Infini-attention adds compressive memory, which can store used old fragments in compressed memory. When output, the current context information and the information in the compressed memory will be aggregated, so the model can retrieve the complete context history.

This method enables Transformer LLM to scale to infinitely long contexts with limited memory and process extremely long inputs for calculations in a streaming manner.

Experiments show that the method outperforms the baseline on long-context language modeling benchmarks while reducing memory parameters by more than 100 times. The model achieves better perplexity when trained with 100K sequence length. In addition, the study found that the 1B model was fine-tuned on key instances of 5K sequence length, solving the 1M length problem. Finally, the paper shows that the 8B model with Infini-attention achieved new SOTA results on the 500K length book summary task after continuous pre-training and task fine-tuning.

The contributions of this article are summarized as follows:

  • Introduces a practical and powerful attention Force mechanism Infini-attention - with long-term compressed memory and local causal attention, can be used to effectively model long-term and short-term context dependencies;
  • Infini-attention has a standard scaling dot product Attention (standard scaled dot-product attention) is minimally changed and is designed to support plug-and-play continuous pre-training and long-context adaptation;
  • This approach enables Transformer LLM is capable of processing extremely long inputs in a streaming manner, scaling to infinitely long contexts with limited memory and computing resources.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
  • ## Paper link: https://arxiv.org/pdf/2404.07143.pdf
  • Paper title: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Method introduction

Infini-attention enables Transformer LLM to efficiently handle infinitely long inputs with limited memory footprint and computation. As shown in Figure 1 below, Infini-attention incorporates compressed memory into the ordinary attention mechanism, and builds masked local attention and long-term linear attention mechanisms in a single Transformer block.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
This subtle but critical modification to the Transformer attention layer can extend the context window of existing LLMs to infinite lengths through continuous pre-training and fine-tuning.

Infini-attention takes all keys, values, and query states of standard attention calculations for long-term memory consolidation and retrieval, and transfers the attention's old KV states are stored in compressed memory instead of discarding them like standard attention mechanisms.When processing subsequent sequences, Infini-attention uses the attention query state to retrieve values ​​from memory. To compute the final context output, Infini-attention aggregates long-term memory retrieval values ​​and local attention context.

As shown in Figure 2 below, the research team compared Infini-Transformer and Transformer-XL based on Infini-attention. Similar to Transformer-XL, Infini-Transformer operates on a sequence of segments and computes the standard causal dot product attention context in each segment. Therefore, the dot product attention computation is local in some sense.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
However, local attention discards the attention state of the previous segment when processing the next segment, but Infini-Transformer reuses the old KV attention state to Maintain the entire context history via compressed storage. Therefore, each attention layer of Infini-Transformer has a global compressed state and a local fine-grained state.

Similar to multi-head attention (MHA), in addition to dot product attention, Infini-attention also maintains H parallel compressed memories (H is the number of attention heads).
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
Table 1 below lists the context memory footprint and effective context length defined by several models based on model parameters and input segment length. Infini-Transformer supports infinite context windows with limited memory footprint.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
Experiment

This research is based on long context language modeling with a length of 1M. The Infini-Transformer model is evaluated on key context block retrieval and 500K length book summarization tasks, which have extremely long input sequences. For language modeling, the researchers chose to train the model from scratch, while for the key and book summary tasks, the researchers used continuous pre-training of LLM to prove Infini-attention's plug-and-play long-context adaptability.

Long context language modeling. Table 2 results show that Infini-Transformer outperforms Transformer-XL and Memorizing Transformers baselines and stores 114x fewer parameters compared to the Memorizing Transformer model.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
Key tasks. Table 3 shows the Infini-Transformer fine-tuned on a 5K length input solving the key task up to 1M context length. The input tokens in the experiment ranged from 32K to 1M. For each test subset, the researchers controlled the position of the key so that it was located near the beginning, middle, or end of the input sequence. Experiments report zero-shot accuracy and fine-tuning accuracy. After 400 steps of fine-tuning on a 5K length input, Infini-Transformer solves tasks up to 1M context length.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
Summary tasks. Table 4 compares Infini-Transformer with an encoder-decoder model built specifically for the summarization task. The results show that Infini-Transformer surpasses the previous best results and achieves new SOTA on BookSum by processing the entire text of the book.
Directly expands to infinite length, Google Infini-Transformer ends the context length debate
#The researchers also plotted the overall Rouge score for the BookSum data validation split in Figure 4. The polyline trend shows that Infini-Transformers improve summary performance metrics as the input length increases.

Directly expands to infinite length, Google Infini-Transformer ends the context length debate

The above is the detailed content of Directly expands to infinite length, Google Infini-Transformer ends the context length debate. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version