Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret-AI-php.cn

Home

Technology peripherals

Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret

PHPz

Apr 25, 2023 pm 10:16 PM

gptModel

Following BERT, researchers have noticed the potential of large-scale pre-training models, and different pre-training tasks, model architectures, training strategies, etc. have been proposed. However, BERT-type models usually have two major shortcomings: one is over-reliance on labeled data; the other is over-fitting.

Specifically, current language models tend to use a two-stage framework, that is, pre-training and fine-tuning downstream tasks, but a large number of samples are required during the fine-tuning process for downstream tasks. Otherwise, the effect is poor, but the cost of labeling data is high. There is also limited labeled data, and the model can only fit the training data distribution. However, if there is less data, it will easily lead to overfitting, which will reduce the generalization ability of the model.

As a pioneer of large models, large-scale pre-trained language models, especially GPT-3, have shown surprising ICL (In-Context Learning) capabilities. Unlike fine-tuning, which requires additional parameter updates, ICL only requires a few demonstration "input-label" pairs, and the model can predict labels even for unseen input labels. In many downstream tasks, a large GPT model can achieve quite good performance, even surpassing some small models with supervised fine-tuning.

Why ICL performs so well? In a more than 70-page paper "Language Models are Few-Shot Learners" from OpenAI, they explored ICL. The purpose is to let GPT-3 use less domain data and solve problems without fine-tuning.

As shown in the figure below, ICL includes three categories: Few-shot learning, which allows the input of several examples and a task description; One-shot learning, which only allows the input of one example and one task description A task description; Zero-shot learning does not allow the input of any examples, only a task description is allowed. The results show that ICL does not require backpropagation and only needs to put a small number of labeled samples in the context of the input text to induce GPT-3 to output answers.

#GPT-3 in-context learning

Experiments have proven that GPT-3 performs very well under Few-shot:

被GPT带飞的In-Context Learning为什么起作用？模型在秘密执行梯度下降

Why GPT Can I study in In-Context?

Although ICL has achieved great success in terms of performance, its working mechanism is still an open problem to be studied. In order to better understand how ICL works, we next introduce how a study from Peking University, Tsinghua University and other institutions explains it.

被GPT带飞的In-Context Learning为什么起作用？模型在秘密执行梯度下降

##Paper address: https://arxiv.org/pdf/2212.10559v2.pdf
Project address: https://github.com/microsoft/LMOps

To summarize in the words of netizens, Namely: "This work shows that GPT naturally learns to use internal optimizations to perform certain runs. The research also provides empirical evidence that In-Context Learning and explicit fine-tuning perform similarly on multiple levels."

To better understand how ICL works, this study interprets the language model as a meta-optimizer, ICL as a meta-optimization process, and ICL as an implicit Fine-tuning, attempts to establish a link between GPT-based ICL and fine-tuning. Theoretically, the study found that Transformer's attention has a form of dual optimization based on gradient descent.

Based on this, this study proposes a new perspective to explain ICL: GPT first generates meta-gradients based on demonstration examples, and then applies these meta-gradients to the original GPT to construct ICL Model.

As shown in Figure 1, ICL and explicit fine-tuning share a dual optimization form based on gradient descent. The only difference is that ICL produces meta-gradients by forward computation, while fine-tuning computes gradients by backpropagation. Therefore, it is reasonable to understand ICL as some kind of implicit fine-tuning.

被GPT带飞的In-Context Learning为什么起作用？模型在秘密执行梯度下降 ICR performs implicit fine-tuning

The study first conducted a qualitative analysis Transformer attention in the form of relaxed linear attention to find its duality with gradient descent-based optimization. The study then compares ICL to explicit fine-tuning and establishes a link between these two forms of optimization. Based on these theoretical findings, they propose to understand ICL as an implicit fine-tuning.

First of all, this study regards Transforme attention as meta-optimization and interprets ICL as a meta-optimization process: (1) A pre-trained language model based on Transformer serves as a meta-optimizer; ( 2) Generate meta-gradients based on instances through forward computation; (3) Apply meta-gradients to the original language model through attention to build ICL.

Next is a comparison of ICL and fine-tuning. Across a range of settings, the study found that ICLs share many properties with fine-tuning. They organized these commonalities from the following four aspects: both perform gradient descent; the same training information; the same causal order of training examples; and both revolve around attention.

Considering all these common properties between ICL and fine-tuning, this study argues that it is reasonable to understand ICL as an implicit fine-tuning. In the remainder of this paper, the study empirically compares ICL and fine-tuning from multiple aspects to provide quantitative results that support this understanding.

Experimental results

This study conducted a series of experiments to comprehensively compare the behavior of ICL and explicit fine-tuning based on actual tasks. On six classification tasks, they The pre-trained GPT is compared in ICL and fine-tuned settings with respect to prediction, attention output and attention score. As expected, ICL is highly similar to explicit fine-tuning in terms of prediction, representation, and attention levels. These results strongly support this plausibility: ICL performs implicit fine-tuning.

In addition, inspired by meta-optimization understanding, this study designed a momentum-based attention by analogy with the momentum-based gradient descent algorithm. It consistently outperforms the performance of vanilla attention.

Table 2 shows the validation accuracy in ZSL (Zero-Shot Learning), ICL and fine-tuning (FT) settings on six classification datasets. Both ICL and fine-tuning achieve considerable improvements compared to ZSL, which means that the optimizations made help these downstream tasks. Furthermore, the study found that ICL performed better than fine-tuning in few-shot scenarios.

The Rec2FTP scores of 2 GPT models on 6 datasets are shown in Table 3. On average, ICL can correctly predict 87.64% of the examples from ZSL that fine-tuning can correct. These results indicate that at the prediction level, ICL can cover most of the correct fine-tuning behaviors. 被GPT带飞的In-Context Learning为什么起作用？模型在秘密执行梯度下降

Table 3 also shows the average SimAOU scores for examples and layers of 2 GPT models on 6 datasets. For comparison, the study also provides a baseline metric (Random SimAOU) that calculates the similarity between ICL updates and randomly generated updates. As can be seen from the table, ICL updates are more similar to fine-tuned updates than random updates, which means that at the representation level, ICL tends to change attention results in the direction of fine-tuned changes.

Finally, Table 3 also shows the average SimAM scores for examples and layers of 2 GPT models on 6 datasets. As the baseline metric for SimAM, ZSL SimAM calculates the similarity between ICL attention weights and ZSL attention weights. By comparing the two metrics, the study found that ICL is more inclined to generate attention weights similar to fine-tuning compared to ZSL. Also at the level of attentional behavior, this study demonstrates that ICL behaves like nudges.

被GPT带飞的In-Context Learning为什么起作用？模型在秘密执行梯度下降

To explore the similarities between ICL and fine-tuning more thoroughly, this study compared SimAOU and SimAM scores for different layers. By randomly sampling 50 validation examples from each dataset, SimAOU and SimAM boxplots were drawn as shown in Figure 2 and Figure 3 below, respectively.

It can be found from the figure that SimAOU and SimAM fluctuate at lower layers and tend to be more stable at higher layers. This phenomenon illustrates that the meta-optimization performed by ICL has a forward accumulation effect, and as accumulation increases, ICL behaves more like higher-level fine-tuning.

被GPT带飞的In-Context Learning为什么起作用？模型在秘密执行梯度下降

Summary

In conclusion, this article aims to explain the working of ICL based on GPT mechanism. Theoretically, this study finds out the dual form of ICL and proposes to understand ICL as a meta-optimization process. Furthermore, this study establishes a link between ICL and specific fine-tuning settings, finding that it is reasonable to consider ICL as an implicit fine-tuning. To support the understanding of implicit fine-tuning performed by ICL, this study comprehensively compares the behavior of ICL and real-world task-based fine-tuning. It turns out that ICL is similar to explicit fine-tuning.

Furthermore, inspired by meta-optimization, this study designed a momentum-based attention to achieve consistent performance improvements. The authors hope that this study can help more people gain insights into ICL applications and model design.

The above is the detailed content of Why does In-Context Learning, driven by GPT, work? The model performs gradient descent in secret. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Will R.E.P.O. Have Crossplay?

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Notepad++7.3.1

Easy-to-use and free code editor

Hot Topics

Where is the login entrance for gmail email?

7555

CakePHP Tutorial

1383

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers