search
HomeTechnology peripheralsAILossless acceleration up to 5x, EAGLE-2 allows RTX 3060 to generate faster than A100

无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100
The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

Li Yuhui: Master of the School of Intelligence, Peking University, under the guidance of teachers Zhang Hongyang and Zhang Chao, his research direction is large model acceleration and alignment, and is currently Looking for job opportunities in the 25th class
Wei Fangyun: Researcher at Microsoft Asia Research Institute, research direction is embodied intelligence, image generation and AI agents

Zhang Chao: Researcher at Peking University School of Intelligence, research direction is computer vision and machine Learn

Zhang Hongyang: Assistant Professor of School of Computer Science and Vector Research Institute, University of Waterloo, research direction is LLM acceleration and AI security

Autoregressive decoding has become the de facto standard for large language models (LLMs), large language models Each forward calculation requires access to all its parameters, but only one token can be obtained, making its generation expensive and slow.

Today, a paper titled "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees" proposed dynamic draft tree speculative sampling, which dynamically adjusts the structure of the draft tree based on the confidence of the draft model, with the highest It can increase the inference speed of large language models by 5 times without changing the output distribution of large language models, ensuring losslessness.

无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100

  • Paper link: https://arxiv.org/pdf/2406.16858
  • Project link: https://github.com/SafeAILab/EAGLE
  • Demo link: https: //huggingface.co/spaces/yuhuili/EAGLE-2

The acceleration effect of EAGLE-2 on the multi-turn dialogue data set MT-bench (the upper picture is greedy generation, the lower picture is sampling generation):
无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100

无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100

Using EAGLE-2, the inference speed of 2 RTX 3060 ($300) can exceed A100 ($10000). 无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100
Background

Speculative sampling uses a small model to quickly generate drafts. The original large language model can verify the correctness of the draft through one forward calculation and take the correct draft as the output. This generates multiple tokens at once and ensures no loss. EAGLE is an improvement on speculative sampling. It performs autoregression at a more regular feature level rather than at the token level, and at the same time inputs the sampling results (tokens one time step ahead) to eliminate uncertainty and significantly improve the accuracy of the draft model.

So far, EAGLE ranks first in the third-party test Spec-Bench (https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md).

Ideas

Methods such as EAGLE and Medusa use static draft trees, implicitly assuming that the acceptance rate of draft tokens is context-independent. Here is a simple example
无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100
When the above is "10+2", the next token is difficult to predict. EAGLE adds two candidate tokens at this position to increase the draft hit rate. Only one of "10+2=" and "10+2+" is correct. When the above is "10+2=", the next token is obviously "1", but EAGLE uses a static draft structure and still adds two candidates "1" and "3". "10+2=3" does not There may be waste through the inspection of large language models. EAGLE-2 aims to solve this problem. As shown in the figure below, when the above is "10+2=", EAGLE-2 only adds one candidate token "1" and uses the saved token to make the draft tree deeper. , so that "10+2=12" passes the inspection of the large language model, and EAGLE-2 can generate more tokens at one time.
无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100
The authors of EAGLE-2 conducted a simple test on the Alpaca data set. The figure below shows the acceptance rate of draft tokens at different positions. P1-P6 in the left figure represents the position, and the horizontal line in the right figure Axis coordinates correspond. The experimental results show that the acceptance rates of draft tokens at the same position are also significantly different, which shows that using dynamic draft trees may achieve better results than static draft trees.
无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100
In the above example, EAGLE-2 determines the structure of the draft tree based on the difficulty of predicting the draft token. Accurate calculation of the difficulty (acceptance rate) requires the calculation results of the original large language model, which violates the reduction of speculative sampling. Original intention for access to original large language models. Fortunately, the confidence of EAGLE's draft model is highly positively correlated with the acceptance rate (difficulty). The figure below shows the average acceptance rate of draft tokens at different confidence intervals of the draft model, with the red dotted line connecting (0,0) and (1,1). It follows that the confidence of the draft model can be used as a valid approximation of the acceptance rate.

无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100

Method

EAGLE-2 includes two stages, expansion and rearrangement. The expansion stage deepens and enlarges the draft tree, and the rearrangement stage prunes the draft tree and discards some nodes (tokens).

In order to ensure losslessness, the premise for a draft token to be accepted is that all its ancestor nodes are accepted, so EAGLE-2 defines the value of a node as the product of it and its ancestor’s acceptance rate, using the confidence level. product to approximate.

In the expansion phase, EAGLE-2 selects the m nodes (tokens) with the highest value in the last layer of the draft tree for expansion. These tokens are fed into the draft model, and then the output of the draft model is connected to the input node as a child node, deepening and enlarging the draft tree. In the reordering phase, EAGLE-2 reorders the entire draft tree according to value, retaining the first n nodes (tokens). The confidence of the draft token is between 0 and 1. When the two nodes have the same value, the shallow nodes are retained first. Therefore, the draft tree retained after rearrangement must be connected, ensuring semantic coherence. After rearrangement, the draft tree becomes smaller, reducing the computational load of the original large language model verification. In order to ensure the accuracy of the calculation results, the attention mask needs to be adjusted to ensure that each token can only see its ancestor nodes and is not affected by other branches. Below is a simple example.
无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100
The yellow boxes in the Expand stage represent the nodes selected for expansion, and the green boxes are the predictions of the draft model when these nodes are used as input. The blue boxes in the Rerank stage represent the retained nodes, which are then flattened into one dimension as input to the original large language model. EAGLE-2 adjusts the attention mask according to the structure of the tree. For example, "a" can only see its ancestors "It" and "is", but cannot see "has" of another branch. EAGLE-2 also adjusts the position encoding to ensure consistency with standard autoregressive decoding.

Experiment

EAGLE-2 uses MT-bench, Humaneval, GSM8K, Alpaca, CNN/ Experiments were conducted on the DM, Natural Questions dataset and compared with 6 advanced speculative sampling methods (SpS, PLD, Medusa, Lookahead, Hydra, EAGLE).
无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100

无损加速最高5x,EAGLE-2让RTX 3060的生成速度超过A100

Speedup in the table is the speedup ratio, and τ is the average acceptance length, which is the number of tokens that the original large language model can generate for each forward calculation. EAGLE-2 can generate about 4-5 tokens for each forward calculation, while autoregressive decoding generates 1 token for each time. Therefore, EAGLE-2 significantly accelerates the generation of large language models, with an acceleration ratio of 2.5x-5x. The speedup and acceptance length are highest on the code generation task (Humaneval dataset), because there are a large number of deterministic templates in the code and drafts are easier to hit. Across all tasks and large language models, EAGLE-2 has the highest acceleration ratio and average acceptance length, significantly better than other methods.

Applications

EAGLE-2 is also used in the industry and integrated into Intel/intel-extension-for-transformers, etc.

The above is the detailed content of Lossless acceleration up to 5x, EAGLE-2 allows RTX 3060 to generate faster than A100. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Cooking Up Innovation: How Artificial Intelligence Is Transforming Food ServiceCooking Up Innovation: How Artificial Intelligence Is Transforming Food ServiceApr 12, 2025 pm 12:09 PM

AI Augmenting Food Preparation While still in nascent use, AI systems are being increasingly used in food preparation. AI-driven robots are used in kitchens to automate food preparation tasks, such as flipping burgers, making pizzas, or assembling sa

Comprehensive Guide on Python Namespaces & Variable ScopesComprehensive Guide on Python Namespaces & Variable ScopesApr 12, 2025 pm 12:00 PM

Introduction Understanding the namespaces, scopes, and behavior of variables in Python functions is crucial for writing efficiently and avoiding runtime errors or exceptions. In this article, we’ll delve into various asp

A Comprehensive Guide to Vision Language Models (VLMs)A Comprehensive Guide to Vision Language Models (VLMs)Apr 12, 2025 am 11:58 AM

Introduction Imagine walking through an art gallery, surrounded by vivid paintings and sculptures. Now, what if you could ask each piece a question and get a meaningful answer? You might ask, “What story are you telling?

MediaTek Boosts Premium Lineup With Kompanio Ultra And Dimensity 9400MediaTek Boosts Premium Lineup With Kompanio Ultra And Dimensity 9400Apr 12, 2025 am 11:52 AM

Continuing the product cadence, this month MediaTek has made a series of announcements, including the new Kompanio Ultra and Dimensity 9400 . These products fill in the more traditional parts of MediaTek’s business, which include chips for smartphone

This Week In AI: Walmart Sets Fashion Trends Before They Ever HappenThis Week In AI: Walmart Sets Fashion Trends Before They Ever HappenApr 12, 2025 am 11:51 AM

#1 Google launched Agent2Agent The Story: It’s Monday morning. As an AI-powered recruiter you work smarter, not harder. You log into your company’s dashboard on your phone. It tells you three critical roles have been sourced, vetted, and scheduled fo

Generative AI Meets PsychobabbleGenerative AI Meets PsychobabbleApr 12, 2025 am 11:50 AM

I would guess that you must be. We all seem to know that psychobabble consists of assorted chatter that mixes various psychological terminology and often ends up being either incomprehensible or completely nonsensical. All you need to do to spew fo

The Prototype: Scientists Turn Paper Into PlasticThe Prototype: Scientists Turn Paper Into PlasticApr 12, 2025 am 11:49 AM

Only 9.5% of plastics manufactured in 2022 were made from recycled materials, according to a new study published this week. Meanwhile, plastic continues to pile up in landfills–and ecosystems–around the world. But help is on the way. A team of engin

The Rise Of The AI Analyst: Why This Could Be The Most Important Job In The AI RevolutionThe Rise Of The AI Analyst: Why This Could Be The Most Important Job In The AI RevolutionApr 12, 2025 am 11:41 AM

My recent conversation with Andy MacMillan, CEO of leading enterprise analytics platform Alteryx, highlighted this critical yet underappreciated role in the AI revolution. As MacMillan explains, the gap between raw business data and AI-ready informat

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.