Home > Article > Technology peripherals > Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret
Following BERT, researchers have noticed the potential of large-scale pre-training models, and different pre-training tasks, model architectures, training strategies, etc. have been proposed. However, BERT-type models usually have two major shortcomings: one is over-reliance on labeled data; the other is over-fitting.
Specifically, current language models tend to use a two-stage framework, that is, pre-training and fine-tuning downstream tasks, but a large number of samples are required during the fine-tuning process for downstream tasks. Otherwise, the effect is poor, but the cost of labeling data is high. There is also limited labeled data, and the model can only fit the training data distribution. However, if there is less data, it will easily lead to overfitting, which will reduce the generalization ability of the model.
As a pioneer of large models, large-scale pre-trained language models, especially GPT-3, have shown surprising ICL (In-Context Learning) capabilities. Unlike fine-tuning, which requires additional parameter updates, ICL only requires a few demonstration "input-label" pairs, and the model can predict labels even for unseen input labels. In many downstream tasks, a large GPT model can achieve quite good performance, even surpassing some small models with supervised fine-tuning.
Why ICL performs so well? In a more than 70-page paper "Language Models are Few-Shot Learners" from OpenAI, they explored ICL. The purpose is to let GPT-3 use less domain data and solve problems without fine-tuning.
As shown in the figure below, ICL includes three categories: Few-shot learning, which allows the input of several examples and a task description; One-shot learning, which only allows the input of one example and one task description A task description; Zero-shot learning does not allow the input of any examples, only a task description is allowed. The results show that ICL does not require backpropagation and only needs to put a small number of labeled samples in the context of the input text to induce GPT-3 to output answers.
#GPT-3 in-context learning
Experiments have proven that GPT-3 performs very well under Few-shot:
Why GPT Can I study in In-Context?Although ICL has achieved great success in terms of performance, its working mechanism is still an open problem to be studied. In order to better understand how ICL works, we next introduce how a study from Peking University, Tsinghua University and other institutions explains it.
To better understand how ICL works, this study interprets the language model as a meta-optimizer, ICL as a meta-optimization process, and ICL as an implicit Fine-tuning, attempts to establish a link between GPT-based ICL and fine-tuning. Theoretically, the study found that Transformer's attention has a form of dual optimization based on gradient descent.
Based on this, this study proposes a new perspective to explain ICL: GPT first generates meta-gradients based on demonstration examples, and then applies these meta-gradients to the original GPT to construct ICL Model.
As shown in Figure 1, ICL and explicit fine-tuning share a dual optimization form based on gradient descent. The only difference is that ICL produces meta-gradients by forward computation, while fine-tuning computes gradients by backpropagation. Therefore, it is reasonable to understand ICL as some kind of implicit fine-tuning. ICR performs implicit fine-tuning The study first conducted a qualitative analysis Transformer attention in the form of relaxed linear attention to find its duality with gradient descent-based optimization. The study then compares ICL to explicit fine-tuning and establishes a link between these two forms of optimization. Based on these theoretical findings, they propose to understand ICL as an implicit fine-tuning. First of all, this study regards Transforme attention as meta-optimization and interprets ICL as a meta-optimization process: (1) A pre-trained language model based on Transformer serves as a meta-optimizer; ( 2) Generate meta-gradients based on instances through forward computation; (3) Apply meta-gradients to the original language model through attention to build ICL. Next is a comparison of ICL and fine-tuning. Across a range of settings, the study found that ICLs share many properties with fine-tuning. They organized these commonalities from the following four aspects: both perform gradient descent; the same training information; the same causal order of training examples; and both revolve around attention. Considering all these common properties between ICL and fine-tuning, this study argues that it is reasonable to understand ICL as an implicit fine-tuning. In the remainder of this paper, the study empirically compares ICL and fine-tuning from multiple aspects to provide quantitative results that support this understanding. Experimental results In addition, inspired by meta-optimization understanding, this study designed a momentum-based attention by analogy with the momentum-based gradient descent algorithm. It consistently outperforms the performance of vanilla attention. Table 2 shows the validation accuracy in ZSL (Zero-Shot Learning), ICL and fine-tuning (FT) settings on six classification datasets. Both ICL and fine-tuning achieve considerable improvements compared to ZSL, which means that the optimizations made help these downstream tasks. Furthermore, the study found that ICL performed better than fine-tuning in few-shot scenarios.
The Rec2FTP scores of 2 GPT models on 6 datasets are shown in Table 3. On average, ICL can correctly predict 87.64% of the examples from ZSL that fine-tuning can correct. These results indicate that at the prediction level, ICL can cover most of the correct fine-tuning behaviors. Table 3 also shows the average SimAOU scores for examples and layers of 2 GPT models on 6 datasets. For comparison, the study also provides a baseline metric (Random SimAOU) that calculates the similarity between ICL updates and randomly generated updates. As can be seen from the table, ICL updates are more similar to fine-tuned updates than random updates, which means that at the representation level, ICL tends to change attention results in the direction of fine-tuned changes. Finally, Table 3 also shows the average SimAM scores for examples and layers of 2 GPT models on 6 datasets. As the baseline metric for SimAM, ZSL SimAM calculates the similarity between ICL attention weights and ZSL attention weights. By comparing the two metrics, the study found that ICL is more inclined to generate attention weights similar to fine-tuning compared to ZSL. Also at the level of attentional behavior, this study demonstrates that ICL behaves like nudges. To explore the similarities between ICL and fine-tuning more thoroughly, this study compared SimAOU and SimAM scores for different layers. By randomly sampling 50 validation examples from each dataset, SimAOU and SimAM boxplots were drawn as shown in Figure 2 and Figure 3 below, respectively. It can be found from the figure that SimAOU and SimAM fluctuate at lower layers and tend to be more stable at higher layers. This phenomenon illustrates that the meta-optimization performed by ICL has a forward accumulation effect, and as accumulation increases, ICL behaves more like higher-level fine-tuning. In conclusion, this article aims to explain the working of ICL based on GPT mechanism. Theoretically, this study finds out the dual form of ICL and proposes to understand ICL as a meta-optimization process. Furthermore, this study establishes a link between ICL and specific fine-tuning settings, finding that it is reasonable to consider ICL as an implicit fine-tuning. To support the understanding of implicit fine-tuning performed by ICL, this study comprehensively compares the behavior of ICL and real-world task-based fine-tuning. It turns out that ICL is similar to explicit fine-tuning. Furthermore, inspired by meta-optimization, this study designed a momentum-based attention to achieve consistent performance improvements. The authors hope that this study can help more people gain insights into ICL applications and model design. Summary
The above is the detailed content of Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret. For more information, please follow other related articles on the PHP Chinese website!