Why does transformer perform so well? Where does the In-Context Learning capability it brings to many large language models come from? In the field of artificial intelligence, transformer has become the dominant model in deep learning, but the theoretical basis for its excellent performance has been insufficiently studied.
Recently, researchers from Google AI, ETH Zurich, and Google DeepMind conducted a new study in an attempt to uncover the secrets of some optimization algorithms in Google AI. In this study, they reverse-engineered the transformer and found some optimization methods. This paper is called "Revealing the Mesa Optimization Algorithm in Transformer"
Paper link: https://arxiv.org/abs/2309.05858
The authors demonstrate that minimizing the universal autoregressive loss results in an auxiliary gradient-based optimization algorithm operating in the forward pass of the Transformer. This phenomenon has recently been called "mesa-optimization." Furthermore, the researchers found that the resulting mesa optimization algorithm exhibited contextual small-shot learning capabilities, independent of model size. The new results therefore complement the principles of small-shot learning that have emerged previously in large language models.
The researchers believe that the success of Transformers is based on architectural biases in its implementation of the Mesa optimization algorithm in the forward pass: (i) defining internal learning goals, and (ii) Optimizing
Figure 1: Illustration of the new hypothesis: optimizing the weights θ of the autoregressive Transformer fθ will produce the mesa implemented in the forward propagation of the model optimization. As input sequence s_1, . . . , s_t is processed to time step t, Transformer (i) creates an internal training set consisting of input-target association pairs, (ii) defines an internal objective function through the result dataset, which is used to measure the performance of the internal model using weights W, (iii) Optimize this objective and use the learned model to generate future predictions.
The contributions of this study include the following:
- Summarizes the theory of von Oswald et al. and demonstrates the Above, Transformers optimizes an internally constructed objective to predict the next element of a sequence from regression using gradient-based methods.
- Experimentally reverse engineered Transformers trained on a simple sequence modeling task and found strong evidence that their forward pass implements a two-step algorithm: (i ) Early self-attention layers build an internal training dataset by grouping and copying labels, thus implicitly building an internal training dataset. Define internal objective functions and (ii) optimize these objectives at a deeper level to generate predictions.
- Similar to LLM, experiments show that simple autoregressive training models can also become contextual learners, and on-the-fly adjustments are crucial to improve LLM's contextual learning and can also improve performance in specific environments. Performance.
- Inspired by the discovery that the attention layer attempts to implicitly optimize the internal objective function, the author introduces the mesa layer, which is a new type of attention layer that can effectively solve the least squares optimization problem rather than just taking a single gradient step to achieve optimality. Experiments demonstrate that a single mesa layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while providing more interpretability.
- ##After preliminary language modeling experiments, it was found that replacing it with the mesa layer Promising results were obtained with the standard self-attention layer, demonstrating the layer’s powerful contextual learning capabilities.
#Builds on recent work showing that transformers explicitly trained to solve small-shot tasks in context can implement gradient descent (GD) algorithms. Here, the authors show that these results generalize to autoregressive sequence modeling—a typical approach to training LLMs.
First, analyze the Transformer trained on simple linear dynamics. In this case, each sequence is generated by a different W* to prevent cross-sequence memorization. In this simple setup, the researchers show how Transformer creates a mesa dataset and uses preprocessed GD to optimize the mesa objective
The rewritten content is: we can aggregate the token structure of adjacent sequence elements by training a deep transformer. Interestingly, this simple preprocessing method results in a very sparse weight matrix (less than 1% of the weights are non-zero), resulting in a reverse engineering algorithm
For single-layer linear self-attention, the weight corresponds to one gradient descent step. For deep Transformers, interpretability becomes difficult. The study relies on linear detection and examines whether hidden activations can predict autoregressive targets or preprocessed inputs
Interestingly, the predictability of both detection methods scales with network depth. increase gradually. This finding suggests that preprocessed GD is hidden in the model.
Figure 2: Reverse engineering of a trained linear self-attention layer.
The study found that the training layer can be perfectly fitted when all degrees of freedom are used in the construction, including not only the learned learning rate eta, but also a set of learned initial weights W_0 . Importantly, as shown in Figure 2, the learned one-step algorithm still performs far better than a single mesa layer.
With simple weight settings, we can notice that it is easy to find through basic optimization that this layer can optimally solve this research task. This result proves that hard-coded inductive bias is beneficial for mesa optimization
With theoretical insights into the multi-layer case, first analyze deep linear and softmax and pay attention only to Transformer. The authors format the input according to a 4-channel structure, , which corresponds to the choice of W_0 = 0.
As with the single-layer model, the authors see clear structure in the weights of the trained model. As a first reverse engineering analysis, this study exploits this structure and builds an algorithm (RevAlg-d, where d represents the number of layers) containing 16 parameters per layer header (instead of 3200). The authors found that this compressed but complex expression can describe the trained model. In particular, it allows interpolation between actual Transformer and RevAlg-d weights in an almost lossless manner
While the RevAlg-d expression interprets the trained Multi-layer Transformer, but it's hard to interpret it as a mesa optimization algorithm. Therefore, the authors employed linear regression probing analysis (Alain & Bengio, 2017; Akyürek et al., 2023) to find the characteristics of the hypothesized mesa optimization algorithm.
On the deep linear self-attention Transformer shown in Figure 3, we can observe that both probes are capable of linear decoding, and as the sequence length and network depth increase , decoding performance is also increased. Therefore, we discovered a basic optimization algorithm that descends layer by layer based on the original mesa-objective Lt (W) while improving the condition number of the mesa optimization problem. This results in a rapid decline in mesa-objective Lt (W). In addition, we can also observe that as the depth increases, the performance improves significantly
With better preprocessing of the data, the autoregressive objective function Lt ( W), so it can be considered that the rapid descent is achieved by this optimization
Figure 3: Multiple layers of reverse engineering the constructed token input Transformer training.
This shows that if the transformer is trained on the built tokens, it will predict with mesa optimization. Interestingly, when sequence elements are given directly, the transformer will construct the token by itself by grouping the elements, which the research team calls "creating the mesa dataset".
Conclusion
The finding of this study is that when trained using the Transformer model for sequence prediction tasks under standard autoregressive objectives, gradient-based Inference algorithms. Therefore, the latest multi-task and meta-learning results can also be applied to traditional self-supervised LLM training settings
In addition, the study also found that the learned autoregressive inference algorithm can be used in different Re-adapt usage in cases where retraining is required to solve supervised contextual learning tasks and thus interpret results within a unified framework
Then , what is the relationship between these and context learning? According to the study, after training the transformer model, on the autoregressive sequence task, it achieves appropriate mesa optimization and therefore can perform few-shot context learning without any fine-tuning
This study assumes that mesa optimization also exists in LLM, thereby improving its context learning ability. Interestingly, the study also observed that effectively adapting prompts for LLM can also lead to substantial improvements in contextual learning capabilities.
##Interested readers can read the original text of the paper to learn more Research more content.
The above is the detailed content of What is the source of Transformer's contextual learning capabilities?. For more information, please follow other related articles on the PHP Chinese website!

Meta has joined hands with partners such as Nvidia, IBM and Dell to expand the enterprise-level deployment integration of Llama Stack. In terms of security, Meta has launched new tools such as Llama Guard 4, LlamaFirewall and CyberSecEval 4, and launched the Llama Defenders program to enhance AI security. In addition, Meta has distributed $1.5 million in Llama Impact Grants to 10 global institutions, including startups working to improve public services, health care and education. The new Meta AI application powered by Llama 4, conceived as Meta AI

Joi AI, a company pioneering human-AI interaction, has introduced the term "AI-lationships" to describe these evolving relationships. Jaime Bronstein, a relationship therapist at Joi AI, clarifies that these aren't meant to replace human c

Online fraud and bot attacks pose a significant challenge for businesses. Retailers fight bots hoarding products, banks battle account takeovers, and social media platforms struggle with impersonators. The rise of AI exacerbates this problem, rende

AI agents are poised to revolutionize marketing, potentially surpassing the impact of previous technological shifts. These agents, representing a significant advancement in generative AI, not only process information like ChatGPT but also take actio

AI's Impact on Crucial NBA Game 4 Decisions Two pivotal Game 4 NBA matchups showcased the game-changing role of AI in officiating. In the first, Denver's Nikola Jokic's missed three-pointer led to a last-second alley-oop by Aaron Gordon. Sony's Haw

Traditionally, expanding regenerative medicine expertise globally demanded extensive travel, hands-on training, and years of mentorship. Now, AI is transforming this landscape, overcoming geographical limitations and accelerating progress through en

Intel is working to return its manufacturing process to the leading position, while trying to attract fab semiconductor customers to make chips at its fabs. To this end, Intel must build more trust in the industry, not only to prove the competitiveness of its processes, but also to demonstrate that partners can manufacture chips in a familiar and mature workflow, consistent and highly reliable manner. Everything I hear today makes me believe Intel is moving towards this goal. The keynote speech of the new CEO Tan Libo kicked off the day. Tan Libai is straightforward and concise. He outlines several challenges in Intel’s foundry services and the measures companies have taken to address these challenges and plan a successful route for Intel’s foundry services in the future. Tan Libai talked about the process of Intel's OEM service being implemented to make customers more

Addressing the growing concerns surrounding AI risks, Chaucer Group, a global specialty reinsurance firm, and Armilla AI have joined forces to introduce a novel third-party liability (TPL) insurance product. This policy safeguards businesses against


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 Chinese version
Chinese version, very easy to use

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.
