search
HomeTechnology peripheralsAIWhat is the source of Transformer's contextual learning capabilities?

Why does transformer perform so well? Where does the In-Context Learning capability it brings to many large language models come from? In the field of artificial intelligence, transformer has become the dominant model in deep learning, but the theoretical basis for its excellent performance has been insufficiently studied.

Recently, researchers from Google AI, ETH Zurich, and Google DeepMind conducted a new study in an attempt to uncover the secrets of some optimization algorithms in Google AI. In this study, they reverse-engineered the transformer and found some optimization methods. This paper is called "Revealing the Mesa Optimization Algorithm in Transformer"

What is the source of Transformers contextual learning capabilities?

Paper link: https://arxiv.org/abs/2309.05858

The authors demonstrate that minimizing the universal autoregressive loss results in an auxiliary gradient-based optimization algorithm operating in the forward pass of the Transformer. This phenomenon has recently been called "mesa-optimization." Furthermore, the researchers found that the resulting mesa optimization algorithm exhibited contextual small-shot learning capabilities, independent of model size. The new results therefore complement the principles of small-shot learning that have emerged previously in large language models.

The researchers believe that the success of Transformers is based on architectural biases in its implementation of the Mesa optimization algorithm in the forward pass: (i) defining internal learning goals, and (ii) Optimizing

What is the source of Transformers contextual learning capabilities?

Figure 1: Illustration of the new hypothesis: optimizing the weights θ of the autoregressive Transformer fθ will produce the mesa implemented in the forward propagation of the model optimization. As input sequence s_1, . . . , s_t is processed to time step t, Transformer (i) creates an internal training set consisting of input-target association pairs, (ii) defines an internal objective function through the result dataset, which is used to measure the performance of the internal model using weights W, (iii) Optimize this objective and use the learned model to generate future predictionsWhat is the source of Transformers contextual learning capabilities?.

The contributions of this study include the following:

  • Summarizes the theory of von Oswald et al. and demonstrates the Above, Transformers optimizes an internally constructed objective to predict the next element of a sequence from regression using gradient-based methods.
  • Experimentally reverse engineered Transformers trained on a simple sequence modeling task and found strong evidence that their forward pass implements a two-step algorithm: (i ) Early self-attention layers build an internal training dataset by grouping and copying labels, thus implicitly building an internal training dataset. Define internal objective functions and (ii) optimize these objectives at a deeper level to generate predictions.
  • Similar to LLM, experiments show that simple autoregressive training models can also become contextual learners, and on-the-fly adjustments are crucial to improve LLM's contextual learning and can also improve performance in specific environments. Performance.
  • Inspired by the discovery that the attention layer attempts to implicitly optimize the internal objective function, the author introduces the mesa layer, which is a new type of attention layer that can effectively solve the least squares optimization problem rather than just taking a single gradient step to achieve optimality. Experiments demonstrate that a single mesa layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while providing more interpretability.

What is the source of Transformers contextual learning capabilities?


  • ##After preliminary language modeling experiments, it was found that replacing it with the mesa layer Promising results were obtained with the standard self-attention layer, demonstrating the layer’s powerful contextual learning capabilities.

#Builds on recent work showing that transformers explicitly trained to solve small-shot tasks in context can implement gradient descent (GD) algorithms. Here, the authors show that these results generalize to autoregressive sequence modeling—a typical approach to training LLMs.

First, analyze the Transformer trained on simple linear dynamics. In this case, each sequence is generated by a different W* to prevent cross-sequence memorization. In this simple setup, the researchers show how Transformer creates a mesa dataset and uses preprocessed GD to optimize the mesa objective

What is the source of Transformers contextual learning capabilities?

The rewritten content is: we can aggregate the token structure of adjacent sequence elements by training a deep transformer. Interestingly, this simple preprocessing method results in a very sparse weight matrix (less than 1% of the weights are non-zero), resulting in a reverse engineering algorithm

What is the source of Transformers contextual learning capabilities?

For single-layer linear self-attention, the weight corresponds to one gradient descent step. For deep Transformers, interpretability becomes difficult. The study relies on linear detection and examines whether hidden activations can predict autoregressive targets or preprocessed inputs

Interestingly, the predictability of both detection methods scales with network depth. increase gradually. This finding suggests that preprocessed GD is hidden in the model.

What is the source of Transformers contextual learning capabilities?

Figure 2: Reverse engineering of a trained linear self-attention layer.

The study found that the training layer can be perfectly fitted when all degrees of freedom are used in the construction, including not only the learned learning rate eta, but also a set of learned initial weights W_0 . Importantly, as shown in Figure 2, the learned one-step algorithm still performs far better than a single mesa layer.

With simple weight settings, we can notice that it is easy to find through basic optimization that this layer can optimally solve this research task. This result proves that hard-coded inductive bias is beneficial for mesa optimization

With theoretical insights into the multi-layer case, first analyze deep linear and softmax and pay attention only to Transformer. The authors format the input according to a 4-channel structure, What is the source of Transformers contextual learning capabilities?, which corresponds to the choice of W_0 = 0.

As with the single-layer model, the authors see clear structure in the weights of the trained model. As a first reverse engineering analysis, this study exploits this structure and builds an algorithm (RevAlg-d, where d represents the number of layers) containing 16 parameters per layer header (instead of 3200). The authors found that this compressed but complex expression can describe the trained model. In particular, it allows interpolation between actual Transformer and RevAlg-d weights in an almost lossless manner

While the RevAlg-d expression interprets the trained Multi-layer Transformer, but it's hard to interpret it as a mesa optimization algorithm. Therefore, the authors employed linear regression probing analysis (Alain & Bengio, 2017; Akyürek et al., 2023) to find the characteristics of the hypothesized mesa optimization algorithm.

On the deep linear self-attention Transformer shown in Figure 3, we can observe that both probes are capable of linear decoding, and as the sequence length and network depth increase , decoding performance is also increased. Therefore, we discovered a basic optimization algorithm that descends layer by layer based on the original mesa-objective Lt (W) while improving the condition number of the mesa optimization problem. This results in a rapid decline in mesa-objective Lt (W). In addition, we can also observe that as the depth increases, the performance improves significantly

With better preprocessing of the data, the autoregressive objective function Lt ( W), so it can be considered that the rapid descent is achieved by this optimization

What is the source of Transformers contextual learning capabilities?

Figure 3: Multiple layers of reverse engineering the constructed token input Transformer training.

This shows that if the transformer is trained on the built tokens, it will predict with mesa optimization. Interestingly, when sequence elements are given directly, the transformer will construct the token by itself by grouping the elements, which the research team calls "creating the mesa dataset".

What is the source of Transformers contextual learning capabilities?

Conclusion

The finding of this study is that when trained using the Transformer model for sequence prediction tasks under standard autoregressive objectives, gradient-based Inference algorithms. Therefore, the latest multi-task and meta-learning results can also be applied to traditional self-supervised LLM training settings

In addition, the study also found that the learned autoregressive inference algorithm can be used in different Re-adapt usage in cases where retraining is required to solve supervised contextual learning tasks and thus interpret results within a unified framework

What is the source of Transformers contextual learning capabilities?

Then , what is the relationship between these and context learning? According to the study, after training the transformer model, on the autoregressive sequence task, it achieves appropriate mesa optimization and therefore can perform few-shot context learning without any fine-tuning

What is the source of Transformers contextual learning capabilities?

This study assumes that mesa optimization also exists in LLM, thereby improving its context learning ability. Interestingly, the study also observed that effectively adapting prompts for LLM can also lead to substantial improvements in contextual learning capabilities.

What is the source of Transformers contextual learning capabilities?


What is the source of Transformers contextual learning capabilities?

##Interested readers can read the original text of the paper to learn more Research more content.

The above is the detailed content of What is the source of Transformer's contextual learning capabilities?. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Meta's New AI Assistant: Productivity Booster Or Time Sink?Meta's New AI Assistant: Productivity Booster Or Time Sink?May 01, 2025 am 11:18 AM

Meta has joined hands with partners such as Nvidia, IBM and Dell to expand the enterprise-level deployment integration of Llama Stack. In terms of security, Meta has launched new tools such as Llama Guard 4, LlamaFirewall and CyberSecEval 4, and launched the Llama Defenders program to enhance AI security. In addition, Meta has distributed $1.5 million in Llama Impact Grants to 10 global institutions, including startups working to improve public services, health care and education. The new Meta AI application powered by Llama 4, conceived as Meta AI

80% Of Gen Zers Would Marry An AI: Study80% Of Gen Zers Would Marry An AI: StudyMay 01, 2025 am 11:17 AM

Joi AI, a company pioneering human-AI interaction, has introduced the term "AI-lationships" to describe these evolving relationships. Jaime Bronstein, a relationship therapist at Joi AI, clarifies that these aren't meant to replace human c

AI Is Making The Internet's Bot Problem Worse. This $2 Billion Startup Is On The Front LinesAI Is Making The Internet's Bot Problem Worse. This $2 Billion Startup Is On The Front LinesMay 01, 2025 am 11:16 AM

Online fraud and bot attacks pose a significant challenge for businesses. Retailers fight bots hoarding products, banks battle account takeovers, and social media platforms struggle with impersonators. The rise of AI exacerbates this problem, rende

Selling To Robots: The Marketing Revolution That Will Make Or Break Your BusinessSelling To Robots: The Marketing Revolution That Will Make Or Break Your BusinessMay 01, 2025 am 11:15 AM

AI agents are poised to revolutionize marketing, potentially surpassing the impact of previous technological shifts. These agents, representing a significant advancement in generative AI, not only process information like ChatGPT but also take actio

How Computer Vision Technology Is Transforming NBA Playoff OfficiatingHow Computer Vision Technology Is Transforming NBA Playoff OfficiatingMay 01, 2025 am 11:14 AM

AI's Impact on Crucial NBA Game 4 Decisions Two pivotal Game 4 NBA matchups showcased the game-changing role of AI in officiating. In the first, Denver's Nikola Jokic's missed three-pointer led to a last-second alley-oop by Aaron Gordon. Sony's Haw

How AI Is Accelerating The Future Of Regenerative MedicineHow AI Is Accelerating The Future Of Regenerative MedicineMay 01, 2025 am 11:13 AM

Traditionally, expanding regenerative medicine expertise globally demanded extensive travel, hands-on training, and years of mentorship. Now, AI is transforming this landscape, overcoming geographical limitations and accelerating progress through en

Key Takeaways From Intel Foundry Direct Connect 2025Key Takeaways From Intel Foundry Direct Connect 2025May 01, 2025 am 11:12 AM

Intel is working to return its manufacturing process to the leading position, while trying to attract fab semiconductor customers to make chips at its fabs. To this end, Intel must build more trust in the industry, not only to prove the competitiveness of its processes, but also to demonstrate that partners can manufacture chips in a familiar and mature workflow, consistent and highly reliable manner. Everything I hear today makes me believe Intel is moving towards this goal. The keynote speech of the new CEO Tan Libo kicked off the day. Tan Libai is straightforward and concise. He outlines several challenges in Intel’s foundry services and the measures companies have taken to address these challenges and plan a successful route for Intel’s foundry services in the future. Tan Libai talked about the process of Intel's OEM service being implemented to make customers more

AI Gone Wrong? Now There's Insurance For ThatAI Gone Wrong? Now There's Insurance For ThatMay 01, 2025 am 11:11 AM

Addressing the growing concerns surrounding AI risks, Chaucer Group, a global specialty reinsurance firm, and Armilla AI have joined forces to introduce a novel third-party liability (TPL) insurance product. This policy safeguards businesses against

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.