Home > Article > Technology peripherals > What is the speculative decoding that GPT-4 might also be using? An article summarizing the past, present and application situations
As we all know, the inference of large language models (LLM) usually requires the use of autoregressive sampling, and this inference process is quite slow. In order to solve this problem, speculative decoding has become a new sampling method for LLM inference. In each sampling step, this method will first predict several possible tokens and then verify whether they are accurate in parallel. Unlike autoregressive decoding, speculative decoding can decode multiple tokens in a single step, thus speeding up inference.
Although speculative decoding shows great potential in many aspects, it also raises some key issues that require in-depth research. First, we need to think about how to select or design an appropriate approximate model to strike a balance between the accuracy of conjecture and the efficiency of generation. Second, it is important to ensure that assessment criteria maintain both the diversity and quality of the results generated. Finally, the alignment of the inference process between the approximate model and the target large model must be carefully considered to improve the accuracy of the inference.
Researchers from Hong Kong Polytechnic University, Peking University, MSRA and Alibaba have conducted a comprehensive investigation on speculative decoding, and Machine Heart has made a comprehensive summary of this.
The article first introduces the early research status of speculative decoding technology in detail, and shows its development process through a timetable (see Figure 2) .
Blockwise Decoding is a method of integrating additional feedforward neural (FFN) heads on the Transformer decoder, which can generate multiple tokens in a single step.
In order to further fully exploit the potential of the block sampling algorithm, a speculative decoding solution is proposed. This algorithm covers an independent approximate model, usually using a specialized non-autoregressive Transformer, capable of performing generation tasks efficiently and accurately.
After the emergence of speculative decoding, some scholars then proposed the "Speculative Sampling Algorithm", which added lossless accelerated kernel sampling to speculative decoding.
Overall, these innovative attempts at speculative decoding have begun to strengthen the Draftthen-Verify paradigm and demonstrate great potential in LLM acceleration.
##Formulas and Definitions
This article proposes an organizational framework to classify related research, as shown in Figure 3 below.
Based on the previous work, this article once again formally defines the "speculative decoding algorithm":
The speculative decoding algorithm is a decoding mode that generates first and then verifies. At each decoding step, it first needs to be able to generate multiple possible tokens, and then use the target large language model to evaluate all these tokens in parallel to speed up Reasoning speed. Algorithm Table 2 is a detailed speculative decoding process.
The article then delves into the two basic sub-steps that are integral to this paradigm - generation and evaluation. .
Generate
At each decoding step, the speculative decoding algorithm first generates Multiple possible tokens serve as speculations on the output content of the target large language model.This article divides the generated content into two categories: independent drafting and self-drafting, and summarizes its formulas in Table 1 below.
##Verify on each decode In this step, the tokens generated by the approximate model are verified in parallel to ensure that the output quality is highly consistent with the target large language model. This process also determines the number of tokens allowed at each step, an important factor that can affect speedup. A summary of the various validation criteria is shown in Table 2 below, including those that support greedy decoding and kernel sampling in large language model inference.
The sub-steps of generation and verification will continue to iterate until the termination condition is met, that is, the [EOS] token is decoded or the sentence reaches the maximum length. In addition, this article introduces the token tree verification algorithm, which is an effective strategy to gradually improve token acceptance. Improving guess accuracy is key to speeding up speculative decoding: predictions from approximate models The closer the behavior is to the target large language model, the higher the acceptance rate of its generated tokens. To this end, existing work explores various knowledge extraction (KD) strategies to align the output content of the approximate model with that of the target large language model. Blocked decoding first uses sequence-level knowledge extraction (Seq-KD) for model alignment, and uses sentences generated by the target large language model to train the approximate model. In addition, Seq-KD is also an effective strategy to improve the quality of parallel decoding generation, improving the generation performance of parallel decoding. The main characteristics of existing speculative decoding methods are summarized in Table 3 below, including the type of approximate model or generation strategy, model alignment method, supported evaluation strategy and degree of acceleration.
In addition to being a general paradigm, recent The work also shows that some variants of speculative decoding exhibit extraordinary effectiveness in specific tasks. Additionally, other research has applied this paradigm to address latency issues unique to certain application scenarios, thereby achieving inference acceleration. For example, some scholars believe that speculative decoding is particularly suitable for tasks where the model input and output are highly similar, such as grammatical error correction and retrieval enhancement generation. In addition to these works, RaLMSpec (Zhang et al., 2023b) uses speculative decoding to accelerate retrieval augmented language models (RaLMs). Question 1: How to weigh the accuracy of predicted content and the efficiency of generating it? Although some progress has been made on this problem, there is still considerable room for improvement in aligning approximate models with what the target large language model generates. In addition to model alignment, other factors such as generation quality and determination of prediction length also affect the accuracy of predictions and deserve further exploration. Question 2: How to combine speculative decoding with other leading technologies? As a general decoding mode, speculative decoding has been combined with other advanced technologies to demonstrate its potential. In addition to accelerating large language models for plain text, the application of speculative decoding in multimodal reasoning, such as image synthesis, text-to-speech synthesis, and video generation, is also an interesting and valuable direction for future research. Please refer to the original paper for more details. Model Alignment
Application
Opportunities and Challenges
The above is the detailed content of What is the speculative decoding that GPT-4 might also be using? An article summarizing the past, present and application situations. For more information, please follow other related articles on the PHP Chinese website!