search
HomeTechnology peripheralsAILarge models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

Artificial Intelligence Feedback (AIF) is going to replace RLHF?


In the field of large models, fine-tuning is an important step to improve model performance. As the number of open source large models gradually increases, people have summarized many methods of fine-tuning, some of which have achieved good results.

Recently, researchers from Meta and New York University used the "self-reward method" to allow large models to generate their own fine-tuning data, which brought something new to people. Shocking.

In the new method, the author conducted three iterations of fine-tuning Llama 2 70B, and the generated model outperformed a number of existing important large-scale models in the AlpacaEval 2.0 rankings. Models, including Claude 2, Gemini Pro and GPT-4.
Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4
Therefore, the paper attracted people’s attention just a few hours after it was posted on arXiv.

Although the method is not yet open source, it is believed that the method used in the paper is clearly described and should not be difficult to reproduce.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

It is well known that tuning large language models (LLMs) using human preference data can greatly improve the instruction tracking performance of pre-trained models. In the GPT series, OpenAI proposed a standard method of human feedback reinforcement learning (RLHF), which allows large models to learn reward models from human preferences, and then allows the reward models to be frozen and used to train LLM using reinforcement learning. This method has gained A huge success.

A new idea that has emerged recently is to avoid training reward models entirely and directly use human preferences to train LLM, such as direct preference optimization (DPO). In both cases above, tuning is bottlenecked by the size and quality of the human preference data, and in the case of RLHF, the quality of tuning is also bottlenecked by the quality of the frozen reward models trained from them.

In new work in Meta, the authors propose to train a self-improving reward model that is not frozen but continuously updated during LLM adjustment to avoid this A bottleneck.

The key to this approach is to develop an agent with all the capabilities required during training (rather than splitting into a reward model and a language model), and let the instructions follow the task The pre-training and multi-task training allow task transfer by training multiple tasks simultaneously.

Therefore the author introduces a self-reward language model, whose agents both act as instructions to follow the model, generating responses for given prompts, and can also generate and evaluate new ones based on examples. instructions to add to their own training set.

The new approach uses a framework similar to iterative DPO to train these models. Starting from a seed model, as shown in Figure 1, in each iteration there is a self-instruction creation process, where the model generates candidate responses for the newly created prompts, and rewards are then assigned by the same model. The latter is achieved through prompts from LLM-as-a-Judge, which can also be viewed as an instruction-following task. Build a preference dataset from the generated data and train the next iteration of the model through DPO.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

  • Paper title: Self-Rewarding Language Models

  • Paper link: https://arxiv. org/abs/2401.10020

Self-rewarded language model

The method proposed by the author first assumes: access to a basic pre-trained language model and a small amount of human-annotated seed data, and then build a model designed to possess both skills: Generate high-quality, helpful (and harmless) responses.

#2. Self-instruction creation: Ability to generate and evaluate new instructions following examples to add to your own training set.

#These skills are used to enable the model to perform self-alignment, i.e. they are the components used to iteratively train itself using Artificial Intelligence Feedback (AIF).

The creation of self-instructions involves generating candidate responses and then letting the model itself judge its quality, i.e. it acts as its own reward model, thereby replacing the need for an external model. This is achieved through the LLM-as-a-Judge mechanism [Zheng et al., 2023b], i.e. by formulating response evaluation as an instruction following task. This self-created AIF preference data was used as the training set.

So during the fine-tuning process, the same model is used in both roles: as a "learner" and as a "judge". Based on the emerging judge role, the model can further improve performance through contextual fine-tuning.

The overall self-alignment process is an iterative process that proceeds by building a series of models, each one an improvement over the last. What’s important in this is that since the model can both improve its generative capabilities and use the same generative mechanism as its own reward model, this means that the reward model itself can improve through these iterations, which is consistent with the standard inherent in reward models. There are differences in approach.

Researchers believe that this method can increase the upper limit of the potential of these learning models to improve themselves in the future and eliminate restrictive bottlenecks.

Figure 1 shows an overview of the method.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

Experiment

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

In the experiment, The researchers used Llama 2 70B as the basic pre-training model. They found that self-reward LLM alignment not only improved instruction following performance but also improved reward modeling capabilities compared to the baseline seed model.

This means that in iterative training, the model is able to provide itself with a better quality preference data set in a given iteration than in the previous iteration. While this effect tends to saturate in the real world, it offers the interesting possibility that the resulting reward model (and thus the LLM) is better than a model trained solely from raw seed data written by humans.

In terms of command following ability, the experimental results are shown in Figure 3:

The researchers evaluated the self-reward on the AlpacaEval 2 ranking list model, the results are shown in Table 1. They observed the same conclusion as the head-to-head evaluation, that is, the winning rate of training iterations was higher than that of GPT4-Turbo, from 9.94% in iteration 1, to 15.38% in iteration 2, to 20.44% in iteration 3. Meanwhile, the Iteration 3 model outperforms many existing models, including Claude 2, Gemini Pro, and GPT4 0613.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4The reward modeling evaluation results are shown in Table 2. The conclusions include:

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

EFT has improved on the SFT baseline, using IFT EFT improved all five measurements compared to IFT alone. For example, the pairwise accuracy agreement with humans increased from 65.1% to 78.7%.
  • Improve reward modeling capabilities through self-training. After a round of self-reward training, the model's ability to provide self-rewards for the next iteration is improved, and its ability to follow instructions is also improved.

  • LLMas-a-Judge Importance of Tips. The researchers used various prompt formats and found that LLMas-a-Judge prompts had higher pairwise accuracy when using the SFT baseline.

The author believes that the self-reward training method not only improves the model's instruction tracking ability, but also improves the model's reward modeling ability in iterations.

Although this is only a preliminary study, it appears to be an exciting direction for such models to better allocate rewards in future iterations. , to improve instruction compliance and achieve a virtuous cycle.

This method also opens up certain possibilities for more complex judgment methods. For example, large models can verify the accuracy of their answers by searching a database, resulting in more accurate and reliable output.

Reference content: https://www.reddit.com/r/MachineLearning/comments/19atnu0/r_selfrewarding_language_models_meta_2024/

The above is the detailed content of Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software