search
HomeTechnology peripheralsAIThe inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

Large language models (LLM) are increasingly used in various fields. However, their text generation process is expensive and slow. This inefficiency is attributed to the algorithm of autoregressive decoding: the generation of each word (token) requires a forward pass, requiring access to an LLM with billions to hundreds of billions of parameters. This results in traditional autoregressive decoding being slower.

Recently, the University of Waterloo, the Canadian Vector Institute, Peking University and other institutions jointly released EAGLE, which aims to improve the inference speed of large language models while ensuring a consistent distribution of model output text. This method extrapolates the second top-level feature vector of LLM, which can significantly improve the generation efficiency.

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

  • Technical report: https://sites.google.com/view/eagle-llm
  • Code (supports commercial Apache 2.0): https://github.com/SafeAILab/EAGLE

EAGLE has the following features:

  • 3 times faster than ordinary autoregressive decoding (13B);
  • 2 times faster than Lookahead decoding (13B);
  • than Medusa Decode (13B) 1.6 times faster;
  • can be proven to be consistent with ordinary decoding in the distribution of generated text;
  • can be trained (in 1-2 days) and tested on RTX 3090;
  • can Use in conjunction with other parallel technologies such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization and hardware optimization.
The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLEThe inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE
One way to speed up autoregressive decoding is speculative sampling. This technique uses a smaller draft model to guess the next multiple words via standard autoregressive generation. The original LLM then verifies these guessed words in parallel (requiring only one forward pass for verification). If the draft model accurately predicts α words, a single forward pass of the original LLM can generate α+1 words.

In speculative sampling, the task of the draft model is to predict the next word based on the current word sequence. Accomplishing this task using a model with a significantly smaller number of parameters is extremely challenging and often yields suboptimal results. Furthermore, the draft model in the standard speculative sampling approach independently predicts the next word without leveraging the rich semantic information extracted by the original LLM, resulting in potential inefficiencies.

This limitation inspired the development of EAGLE. EAGLE utilizes the contextual features extracted by the original LLM (i.e., the feature vector output by the second top layer of the model). EAGLE is built on the following first principles:

Feature vector sequences are compressible, so it is easier to predict subsequent feature vectors based on previous feature vectors.

EAGLE trains a lightweight plug-in called the Auto-regression Head that, together with the word embedding layer, predicts the next feature from the second top layer of the original model based on the current feature sequence. The frozen classification head of the original LLM is then used to predict the next word. Features contain more information than word sequences, making the task of regressing features much simpler than the task of predicting words. In summary, EAGLE extrapolates at the feature level, using a small autoregressive head, and then utilizes a frozen classification head to generate predicted word sequences. Consistent with similar work such as Speculative Sampling, Medusa, and Lookahead, EAGLE focuses on the latency of per-cue inference rather than overall system throughput.

EAGLE - a method to enhance the efficiency of large language model generation

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

The above figure shows the difference in input and output between EAGLE and standard speculative sampling, Medusa and Lookahead. The figure below shows the workflow of EAGLE. In the forward pass of the original LLM, EAGLE collects features from the second top layer. The autoregressive head takes these features and the word embeddings of previously generated words as input and starts guessing the next word. Subsequently, the frozen classification head (LM Head) is used to determine the distribution of the next word, allowing EAGLE to sample from this distribution. By repeating sampling multiple times, EAGLE performs a tree-like generation process, as shown on the right side of the figure below. In this example, EAGLE’s triple forward pass “guessed” a tree of 10 words.

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

EAGLE uses a lightweight autoregressive head to predict features of the original LLM. To ensure the consistency of the generated text distribution, EAGLE then validates the predicted tree structure. This verification process can be completed using a forward pass. Through this cycle of prediction and verification, EAGLE is able to quickly generate text words.

The cost of training an autoregressive head is very small. EAGLE is trained using the ShareGPT dataset, which contains just under 70,000 dialogue rounds. The number of trainable parameters of the autoregressive head is also very small. As shown in blue in the image above, most components are frozen. The only additional training required is the autoregressive head, which is a single-layer Transformer structure with 0.24B-0.99B parameters. Autoregressive heads can be trained even when GPU resources are insufficient. For example, Vicuna 33B's autoregressive regression can be trained in 24 hours on an 8-card RTX 3090 server.

Why use word embeddings to predict features?

Medusa only uses the features of the second top layer to predict the next word, the next word... Unlike Medusa, EAGLE also dynamically uses the currently sampled word embedding as input to the autoregressive head part to make predictions. This additional information helps EAGLE handle the inevitable randomness in the sampling process. Consider the example in the image below, assuming the prompt word is "I". LLM gives the probability that "I" is followed by "am" or "always". Medusa does not consider whether "am" or "always" is sampled, and directly predicts the probability of the next word under "I". Therefore, Medusa's goal is to predict the next word for "I am" or "I always" given only "I". Due to the randomness of the sampling process, the same input "I" to Medusa may have a different next word output "ready" or "begin", resulting in a lack of consistent mapping between inputs and outputs. In contrast, the input to EAGLE includes the word embeddings of the sampled results, ensuring a consistent mapping between input and output. This distinction allows EAGLE to more accurately predict subsequent words by taking into account the context established by the sampling process.

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

Tree-like generation structure

Different from other guessing-verification frameworks such as speculative sampling, Lookahead and Medusa, EAGLE adopts a tree-like generation structure in the "guessing word" stage, thereby achieving a more High decoding efficiency. As shown in the figure, the generation process of standard speculative sampling and Lookahead is linear or chained. Since the context cannot be constructed during the guessing stage, Medusa's method generates trees through Cartesian product, resulting in a fully connected graph between adjacent layers. This approach often results in meaningless combinations, such as "I am begin." In contrast, EAGLE creates a sparser tree structure. This sparse tree structure prevents the formation of meaningless sequences and focuses computing resources on more reasonable word combinations.

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

Multiple rounds of speculative sampling

The standard speculative sampling method maintains the consistency of distribution during the process of "guessing words". In order to adapt to tree-like word guessing scenarios, EAGLE extends this method into a multi-round recursive form. Pseudocode for multiple rounds of speculative sampling is presented below. During the tree generation process, EAGLE records the probability corresponding to each sampled word. Through multiple rounds of speculative sampling, EAGLE ensures that the final generated distribution of each word is consistent with that of the original LLM.

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

Lebih banyak hasil percubaan

Rajah berikut menunjukkan kesan pecutan EAGLE pada Vicuna 33B pada tugasan yang berbeza. Tugas "Pengekodan" yang melibatkan sejumlah besar templat tetap menunjukkan prestasi pecutan terbaik.

The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE

Alu-alukan semua orang untuk mengalami EAGLE dan berikan maklum balas dan cadangan melalui isu GitHub: https://github.com/SafeAILab/EAGLE/issues

The above is the detailed content of The inference efficiency of large models has been improved by 3 times without loss, and the University of Waterloo, Peking University and other institutions released EAGLE. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
An easy-to-understand explanation of how to save conversation history (conversation log) in ChatGPT!An easy-to-understand explanation of how to save conversation history (conversation log) in ChatGPT!May 16, 2025 am 05:41 AM

Various ways to efficiently save ChatGPT dialogue records Have you ever thought about saving a ChatGPT-generated conversation record? This article will introduce a variety of saving methods in detail, including official functions, Chrome extensions and screenshots, etc., to help you make full use of ChatGPT conversation records. Understand the characteristics and steps of various methods and choose the one that suits you best. [Introduction to the latest AI proxy "OpenAI Operator" released by OpenAI] (The link to OpenAI Operator should be inserted here) Table of contents Save conversation records using ChatGPT Export Steps to use the official export function Save ChatGPT logs using Chrome extension ChatGP

Create a schedule with ChatGPT! Explaining prompts that can be used to create and adjust tablesCreate a schedule with ChatGPT! Explaining prompts that can be used to create and adjust tablesMay 16, 2025 am 05:40 AM

Modern society has a compact pace and efficient schedule management is crucial. Work, life, study and other tasks are intertwined, and prioritization and schedules are often a headache. Therefore, intelligent schedule management methods using AI technology have attracted much attention. In particular, ChatGPT's powerful natural language processing capabilities can automate tedious schedules and task management, significantly improving productivity. This article will explain in-depth how to use ChatGPT for schedule management. We will combine specific cases and steps to demonstrate how AI can improve daily life and work efficiency. In addition, we will discuss things to note when using ChatGPT to ensure safe and effective use of this technology. Experience ChatGPT now and get your schedule

How to connect ChatGPT with spreadsheets! A thorough explanation of what you can doHow to connect ChatGPT with spreadsheets! A thorough explanation of what you can doMay 16, 2025 am 05:39 AM

We will explain how to link Google Sheets and ChatGPT to improve business efficiency. In this article, we will explain in detail how to use the add-on "GPT for Sheets and Docs" that is easy for beginners to use. No programming knowledge is required. Increased business efficiency through ChatGPT and spreadsheet integration This article will focus on how to connect ChatGPT with spreadsheets using add-ons. Add-ons allow you to easily integrate ChatGPT features into your spreadsheets. GPT for Shee

6 Investor Predictions For AI In 20256 Investor Predictions For AI In 2025May 16, 2025 am 05:37 AM

There are overarching trends and patterns that experts are highlighting as they forecast the next few years of the AI revolution. For instance, there's a significant demand for data, which we will discuss later. Additionally, the need for energy is d

Use ChatGPT for writing! A thorough explanation of tips and examples of prompts!Use ChatGPT for writing! A thorough explanation of tips and examples of prompts!May 16, 2025 am 05:36 AM

ChatGPT is not just a text generation tool, it is a true partner that dramatically increases writers' creativity. By using ChatGPT for the entire writing process, such as initial manuscript creation, ideation ideas, and stylistic changes, you can simultaneously save time and improve quality. This article will explain in detail the specific ways to use ChatGPT at each stage, as well as tips for maximizing productivity and creativity. Additionally, we will examine the synergy that combines ChatGPT with grammar checking tools and SEO optimization tools. Through collaboration with AI, writers can create originality with free ideas

How to create graphs in ChatGPT! No plugins required, so it can be used for Excel too!How to create graphs in ChatGPT! No plugins required, so it can be used for Excel too!May 16, 2025 am 05:35 AM

Data visualization using ChatGPT: From graph creation to data analysis Data visualization, which conveys complex information in an easy-to-understand manner, is essential in modern society. In recent years, due to the advancement of AI technology, graph creation using ChatGPT has attracted attention. In this article, we will explain how to create graphs using ChatGPT in an easy-to-understand manner even for beginners. We will introduce the differences between the free version and the paid version (ChatGPT Plus), specific creation steps, and how to display Japanese labels, along with practical examples. Creating graphs using ChatGPT: From basics to advanced use ChatG

Pushing The Limits Of Modern LLMs With A Dinner Plate?Pushing The Limits Of Modern LLMs With A Dinner Plate?May 16, 2025 am 05:34 AM

In general, we know that AI is big, and getting bigger. It’s fast, and getting faster. Specifically, though, not everyone’s familiar with some of the newest hardware and software approaches in the industry, and how they promote better results. Peopl

Archive your ChatGPT conversation history! Explaining the steps to save and how to restore itArchive your ChatGPT conversation history! Explaining the steps to save and how to restore itMay 16, 2025 am 05:33 AM

ChatGPT Dialogue Record Management Guide: Efficiently organize and make full use of your treasure house of knowledge! ChatGPT dialogue records are a source of creativity and knowledge, but how can growing records be effectively managed? Is it time-consuming to find important information? don’t worry! This article will explain in detail how to effectively "archive" (save and manage) your ChatGPT conversation records. We will cover official archive functions, data export, shared links, and data utilization and considerations. Table of contents Detailed explanation of ChatGPT's "archive" function How to use ChatGPT archive function Save location and viewing method of ChatGPT archive records Cancel and delete methods for ChatGPT archive records Cancel archive Delete the archive Summarize Ch

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!