


Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret
Following BERT, researchers have noticed the potential of large-scale pre-training models, and different pre-training tasks, model architectures, training strategies, etc. have been proposed. However, BERT-type models usually have two major shortcomings: one is over-reliance on labeled data; the other is over-fitting.
Specifically, current language models tend to use a two-stage framework, that is, pre-training and fine-tuning downstream tasks, but a large number of samples are required during the fine-tuning process for downstream tasks. Otherwise, the effect is poor, but the cost of labeling data is high. There is also limited labeled data, and the model can only fit the training data distribution. However, if there is less data, it will easily lead to overfitting, which will reduce the generalization ability of the model.
As a pioneer of large models, large-scale pre-trained language models, especially GPT-3, have shown surprising ICL (In-Context Learning) capabilities. Unlike fine-tuning, which requires additional parameter updates, ICL only requires a few demonstration "input-label" pairs, and the model can predict labels even for unseen input labels. In many downstream tasks, a large GPT model can achieve quite good performance, even surpassing some small models with supervised fine-tuning.
Why ICL performs so well? In a more than 70-page paper "Language Models are Few-Shot Learners" from OpenAI, they explored ICL. The purpose is to let GPT-3 use less domain data and solve problems without fine-tuning.
As shown in the figure below, ICL includes three categories: Few-shot learning, which allows the input of several examples and a task description; One-shot learning, which only allows the input of one example and one task description A task description; Zero-shot learning does not allow the input of any examples, only a task description is allowed. The results show that ICL does not require backpropagation and only needs to put a small number of labeled samples in the context of the input text to induce GPT-3 to output answers.
#GPT-3 in-context learning
Experiments have proven that GPT-3 performs very well under Few-shot:
Although ICL has achieved great success in terms of performance, its working mechanism is still an open problem to be studied. In order to better understand how ICL works, we next introduce how a study from Peking University, Tsinghua University and other institutions explains it.
- ##Paper address: https://arxiv.org/pdf/2212.10559v2.pdf
- Project address: https://github.com/microsoft/LMOps
To better understand how ICL works, this study interprets the language model as a meta-optimizer, ICL as a meta-optimization process, and ICL as an implicit Fine-tuning, attempts to establish a link between GPT-based ICL and fine-tuning. Theoretically, the study found that Transformer's attention has a form of dual optimization based on gradient descent.
Based on this, this study proposes a new perspective to explain ICL: GPT first generates meta-gradients based on demonstration examples, and then applies these meta-gradients to the original GPT to construct ICL Model.
As shown in Figure 1, ICL and explicit fine-tuning share a dual optimization form based on gradient descent. The only difference is that ICL produces meta-gradients by forward computation, while fine-tuning computes gradients by backpropagation. Therefore, it is reasonable to understand ICL as some kind of implicit fine-tuning. The study first conducted a qualitative analysis Transformer attention in the form of relaxed linear attention to find its duality with gradient descent-based optimization. The study then compares ICL to explicit fine-tuning and establishes a link between these two forms of optimization. Based on these theoretical findings, they propose to understand ICL as an implicit fine-tuning. First of all, this study regards Transforme attention as meta-optimization and interprets ICL as a meta-optimization process: (1) A pre-trained language model based on Transformer serves as a meta-optimizer; ( 2) Generate meta-gradients based on instances through forward computation; (3) Apply meta-gradients to the original language model through attention to build ICL. Next is a comparison of ICL and fine-tuning. Across a range of settings, the study found that ICLs share many properties with fine-tuning. They organized these commonalities from the following four aspects: both perform gradient descent; the same training information; the same causal order of training examples; and both revolve around attention. Considering all these common properties between ICL and fine-tuning, this study argues that it is reasonable to understand ICL as an implicit fine-tuning. In the remainder of this paper, the study empirically compares ICL and fine-tuning from multiple aspects to provide quantitative results that support this understanding. Experimental results In addition, inspired by meta-optimization understanding, this study designed a momentum-based attention by analogy with the momentum-based gradient descent algorithm. It consistently outperforms the performance of vanilla attention. Table 2 shows the validation accuracy in ZSL (Zero-Shot Learning), ICL and fine-tuning (FT) settings on six classification datasets. Both ICL and fine-tuning achieve considerable improvements compared to ZSL, which means that the optimizations made help these downstream tasks. Furthermore, the study found that ICL performed better than fine-tuning in few-shot scenarios.
The Rec2FTP scores of 2 GPT models on 6 datasets are shown in Table 3. On average, ICL can correctly predict 87.64% of the examples from ZSL that fine-tuning can correct. These results indicate that at the prediction level, ICL can cover most of the correct fine-tuning behaviors. Table 3 also shows the average SimAOU scores for examples and layers of 2 GPT models on 6 datasets. For comparison, the study also provides a baseline metric (Random SimAOU) that calculates the similarity between ICL updates and randomly generated updates. As can be seen from the table, ICL updates are more similar to fine-tuned updates than random updates, which means that at the representation level, ICL tends to change attention results in the direction of fine-tuned changes. Finally, Table 3 also shows the average SimAM scores for examples and layers of 2 GPT models on 6 datasets. As the baseline metric for SimAM, ZSL SimAM calculates the similarity between ICL attention weights and ZSL attention weights. By comparing the two metrics, the study found that ICL is more inclined to generate attention weights similar to fine-tuning compared to ZSL. Also at the level of attentional behavior, this study demonstrates that ICL behaves like nudges. To explore the similarities between ICL and fine-tuning more thoroughly, this study compared SimAOU and SimAM scores for different layers. By randomly sampling 50 validation examples from each dataset, SimAOU and SimAM boxplots were drawn as shown in Figure 2 and Figure 3 below, respectively. It can be found from the figure that SimAOU and SimAM fluctuate at lower layers and tend to be more stable at higher layers. This phenomenon illustrates that the meta-optimization performed by ICL has a forward accumulation effect, and as accumulation increases, ICL behaves more like higher-level fine-tuning. In conclusion, this article aims to explain the working of ICL based on GPT mechanism. Theoretically, this study finds out the dual form of ICL and proposes to understand ICL as a meta-optimization process. Furthermore, this study establishes a link between ICL and specific fine-tuning settings, finding that it is reasonable to consider ICL as an implicit fine-tuning. To support the understanding of implicit fine-tuning performed by ICL, this study comprehensively compares the behavior of ICL and real-world task-based fine-tuning. It turns out that ICL is similar to explicit fine-tuning. Furthermore, inspired by meta-optimization, this study designed a momentum-based attention to achieve consistent performance improvements. The authors hope that this study can help more people gain insights into ICL applications and model design. ICR performs implicit fine-tuning
Summary
The above is the detailed content of Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret. For more information, please follow other related articles on the PHP Chinese website!

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Notepad++7.3.1
Easy-to-use and free code editor