The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
Li Yuhui: Master of the School of Intelligence, Peking University, under the guidance of teachers Zhang Hongyang and Zhang Chao, his research direction is large model acceleration and alignment, and is currently Looking for job opportunities in the 25th classWei Fangyun: Researcher at Microsoft Asia Research Institute, research direction is embodied intelligence, image generation and AI agentsZhang Chao: Researcher at Peking University School of Intelligence, research direction is computer vision and machine Learn
Zhang Hongyang: Assistant Professor of School of Computer Science and Vector Research Institute, University of Waterloo, research direction is LLM acceleration and AI security
Autoregressive decoding has become the de facto standard for large language models (LLMs), large language models Each forward calculation requires access to all its parameters, but only one token can be obtained, making its generation expensive and slow. Today, a paper titled "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees" proposed dynamic draft tree speculative sampling, which dynamically adjusts the structure of the draft tree based on the confidence of the draft model, with the highest It can increase the inference speed of large language models by 5 times without changing the output distribution of large language models, ensuring losslessness.
- Paper link: https://arxiv.org/pdf/2406.16858
- Project link: https://github.com/SafeAILab/EAGLE
- Demo link: https: //huggingface.co/spaces/yuhuili/EAGLE-2
The acceleration effect of EAGLE-2 on the multi-turn dialogue data set MT-bench (the upper picture is greedy generation, the lower picture is sampling generation): Using EAGLE-2, the inference speed of 2 RTX 3060 ($300) can exceed A100 ($10000). Speculative sampling uses a small model to quickly generate drafts. The original large language model can verify the correctness of the draft through one forward calculation and take the correct draft as the output. This generates multiple tokens at once and ensures no loss. EAGLE is an improvement on speculative sampling. It performs autoregression at a more regular feature level rather than at the token level, and at the same time inputs the sampling results (tokens one time step ahead) to eliminate uncertainty and significantly improve the accuracy of the draft model. So far, EAGLE ranks first in the third-party test Spec-Bench (https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md). Methods such as EAGLE and Medusa use static draft trees, implicitly assuming that the acceptance rate of draft tokens is context-independent. Here is a simple example When the above is "10+2", the next token is difficult to predict. EAGLE adds two candidate tokens at this position to increase the draft hit rate. Only one of "10+2=" and "10+2+" is correct. When the above is "10+2=", the next token is obviously "1", but EAGLE uses a static draft structure and still adds two candidates "1" and "3". "10+2=3" does not There may be waste through the inspection of large language models. EAGLE-2 aims to solve this problem. As shown in the figure below, when the above is "10+2=", EAGLE-2 only adds one candidate token "1" and uses the saved token to make the draft tree deeper. , so that "10+2=12" passes the inspection of the large language model, and EAGLE-2 can generate more tokens at one time. The authors of EAGLE-2 conducted a simple test on the Alpaca data set. The figure below shows the acceptance rate of draft tokens at different positions. P1-P6 in the left figure represents the position, and the horizontal line in the right figure Axis coordinates correspond. The experimental results show that the acceptance rates of draft tokens at the same position are also significantly different, which shows that using dynamic draft trees may achieve better results than static draft trees. In the above example, EAGLE-2 determines the structure of the draft tree based on the difficulty of predicting the draft token. Accurate calculation of the difficulty (acceptance rate) requires the calculation results of the original large language model, which violates the reduction of speculative sampling. Original intention for access to original large language models. Fortunately, the confidence of EAGLE's draft model is highly positively correlated with the acceptance rate (difficulty). The figure below shows the average acceptance rate of draft tokens at different confidence intervals of the draft model, with the red dotted line connecting (0,0) and (1,1). It follows that the confidence of the draft model can be used as a valid approximation of the acceptance rate.
EAGLE-2 includes two stages, expansion and rearrangement. The expansion stage deepens and enlarges the draft tree, and the rearrangement stage prunes the draft tree and discards some nodes (tokens). In order to ensure losslessness, the premise for a draft token to be accepted is that all its ancestor nodes are accepted, so EAGLE-2 defines the value of a node as the product of it and its ancestor’s acceptance rate, using the confidence level. product to approximate. In the expansion phase, EAGLE-2 selects the m nodes (tokens) with the highest value in the last layer of the draft tree for expansion. These tokens are fed into the draft model, and then the output of the draft model is connected to the input node as a child node, deepening and enlarging the draft tree. In the reordering phase, EAGLE-2 reorders the entire draft tree according to value, retaining the first n nodes (tokens). The confidence of the draft token is between 0 and 1. When the two nodes have the same value, the shallow nodes are retained first. Therefore, the draft tree retained after rearrangement must be connected, ensuring semantic coherence. After rearrangement, the draft tree becomes smaller, reducing the computational load of the original large language model verification. In order to ensure the accuracy of the calculation results, the attention mask needs to be adjusted to ensure that each token can only see its ancestor nodes and is not affected by other branches. Below is a simple example. The yellow boxes in the Expand stage represent the nodes selected for expansion, and the green boxes are the predictions of the draft model when these nodes are used as input. The blue boxes in the Rerank stage represent the retained nodes, which are then flattened into one dimension as input to the original large language model. EAGLE-2 adjusts the attention mask according to the structure of the tree. For example, "a" can only see its ancestors "It" and "is", but cannot see "has" of another branch. EAGLE-2 also adjusts the position encoding to ensure consistency with standard autoregressive decoding. EAGLE-2 uses MT-bench, Humaneval, GSM8K, Alpaca, CNN/ Experiments were conducted on the DM, Natural Questions dataset and compared with 6 advanced speculative sampling methods (SpS, PLD, Medusa, Lookahead, Hydra, EAGLE).
Speedup in the table is the speedup ratio, and τ is the average acceptance length, which is the number of tokens that the original large language model can generate for each forward calculation. EAGLE-2 can generate about 4-5 tokens for each forward calculation, while autoregressive decoding generates 1 token for each time. Therefore, EAGLE-2 significantly accelerates the generation of large language models, with an acceleration ratio of 2.5x-5x. The speedup and acceptance length are highest on the code generation task (Humaneval dataset), because there are a large number of deterministic templates in the code and drafts are easier to hit. Across all tasks and large language models, EAGLE-2 has the highest acceleration ratio and average acceptance length, significantly better than other methods. EAGLE-2 is also used in the industry and integrated into Intel/intel-extension-for-transformers, etc. The above is the detailed content of Lossless acceleration up to 5x, EAGLE-2 allows RTX 3060 to generate faster than A100. For more information, please follow other related articles on the PHP Chinese website!
Statement:The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn