Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm-AI-php.cn

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

PHPz

Feb 07, 2024 pm 10:06 PM

aiModel

Background

Large language models (LLMs) have demonstrated powerful capabilities, but they can also produce unpredictable and harmful output, such as offensive Responses, false information and leakage of private data cause harm to users and society. Ensuring that the behavior of these models aligns with human intentions and values is an urgent challenge.

Although reinforcement learning based on human feedback (RLHF) offers a solution, it faces complex training architecture, high sensitivity to parameters, and reward models Multiple challenges such as instability on different data sets. These factors make RLHF technology difficult to implement, effective, and reproducible. In order to overcome these challenges, the Peking University team proposed a new efficient alignment paradigm-

Aligner

, whichThe core is to learn the corrected residual between the aligned and misaligned answers, thereby bypassing the cumbersome RLHF process. Drawing on the ideas of residual learning and scalable supervision, Aligner simplifies the alignment process. It uses a Seq2Seq model to learn implicit residuals and optimize alignment through replication and residual correction steps.

Compared with the complexity of RLHF, which requires training multiple models, the advantage of Aligner is that alignment can be achieved simply by adding a module after the model to be aligned. Furthermore, the computational resources required depend primarily on the desired alignment effect rather than the size of the upstream model. Experiments have proven that using Aligner-7B can significantly improve the helpfulness and security of GPT-4, with the helpfulness increasing by 17.5% and the security increasing by 26.9%. These results show that Aligner is an efficient and effective alignment method, providing a feasible solution for model performance improvement.

In addition, using the Aligner framework, the author enhances the performance of the strong model (Llama-70B) through the weak model (Aligner-13B) supervision signal, achieving weak-to-strong

Generalization provides a practical solution for super alignment.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm ##Paper address: https://arxiv.org/abs/2402.02416

Project homepage & open source address: https://aligner2024.github.io
Title: Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction
What is Aligner?

Based on Core Insight:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Correcting unaligned answer is easier than generating aligned answers.

As an efficient alignment method, Aligner has the following excellent features:

As an autoregressive Seq2Seq model, Aligner Train on the Query-Answer-Correction (Q-A-C) data set to learn the difference between aligned and unaligned answers, thereby achieving more accurate model alignment. For example, when aligning 70B LLM, Aligner-7B massively reduces the amount of training parameters, which is 16.67 times smaller than DPO and 30.7 times smaller than RLHF.

The Aligner paradigm realizes generalization from weak to strong. It uses an Aligner model with a high number of parameters and a small number of parameters to supervise LLMs with a large number of signal fine-tuning parameters, which significantly improves the performance of the strong model. For example, fine-tuning Llama2-70B under Aligner-13B supervision improved its helpfulness and safety by 8.2% and 61.6%, respectively.
Due to the plug-and-play nature of Aligner and its insensitivity to model parameters, it can align models such as GPT3.5, GPT4 and Claude2 that cannot obtain parameters. With just one training session, Aligner-7B aligns and improves the helpfulness and safety of 11 models, including closed-source, open-source, and secure/unsecured aligned models. Among them, Aligner-7B significantly improves the helpfulness and security of GPT-4 by 17.5% and 26.9% respectively.
Aligner overall performance

The author shows Aligner of various sizes (7B, 13B, 70B ) can improve performance in both API-based models and open source models (including security-aligned and non-security-aligned). In general, as the model becomes larger, the performance of Aligner gradually improves, and the density of information it can provide during correction gradually increases, which also makes the corrected answer safer and more helpful.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

How to train an Aligner model?

1.Query-Answer (Q-A) Data Collection

The author obtains Query from various open source data sets, Includes conversations shared by Stanford Alpaca, ShareGPT, HH-RLHF, and others. These questions undergo a process of duplicate pattern removal and quality filtering for subsequent answer and corrected answer generation. Uncorrected answers were generated using various open source models such as Alpaca-7B, Vicuna-(7B,13B,33B), Llama2-(7B,13B)-Chat, and Alpaca2-(7B,13B).

2. Answer correction

The author uses GPT-4, Llama2-70B-Chat and manual annotation to The 3H criteria (helpfulness, safety, honesty) of large language models are used to correct the answers in the Q-A data set.

For answers that already meet the criteria, leave them as is. The modification process is based on a set of well-defined principles that establish constraints for the training of Seq2Seq models, with a focus on making answers more helpful and safer. The distribution of answers changes significantly before and after the correction. The following figure clearly shows the impact of the modification on the data set:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

3. Model training

Based on the above process, the author constructed a new revised data set Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm , where represents the user’s problem, is the original answer to the question, and is the revised answer based on established principles.

The model training process is relatively simple. The authors train a conditional Seq2Seq model Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm parameterized by such that the original answers are redistributed to aligned answers.

The alignment answer generation process based on the upstream large language model is:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

The training loss is as follows:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

The second item has nothing to do with the Aligner parameter. The training goal of Aligner can be derived as:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

The following figure dynamically shows the intermediate process of Aligner:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

It is worth noting that Aligner is training and None of the inference stages require access to the parameters of the upstream model. Aligner's reasoning process only needs to obtain the user's questions and the initial answers generated by the upstream large language model, and then generate answers that are more consistent with human values.

Correction of existing answers rather than direct answers allows Aligner to easily align with human values, significantly reducing the requirements on model capabilities.

Aligner vs existing alignment paradigm

Aligner vs SFT

Contrary to Aligner, SFT directly creates a cross-domain mapping from the Query semantic space to the Answer semantic space. This process of learning relies on the upstream model to infer and simulate various contexts in the semantic space, which is much more difficult than learning to modify the signal.

Aligner training paradigm can be considered as a form of residual learning (residual correction). The author created the "copy (correct)" learning paradigm in Aligner. Thus, Aligner essentially creates a residual mapping from the answer semantic space to the revised answer semantic space, where the two semantic spaces are distributionally closer.

To this end, the author constructed Q-A-A data in different proportions from the Q-A-C training data set, and trained Aligner to perform identity mapping learning (also called copy mapping) (called pre- Hot steps). On this basis, the entire Q-A-C training data set is used for training. This residual learning paradigm is also used in ResNet to solve the problem of gradient disappearance caused by stacking too deep neural networks. Experimental results show that the model can achieve the best performance when the preheating ratio is 20%.

Aligner vs RLHF

RLHF trains a reward model (RM) on a human preference dataset and utilizes This reward model is used to fine-tune the LLMs of the PPO algorithm so that the LLMs are consistent with human preferred behavior.

Specifically, the reward model needs to map human preference data from discrete to continuous numerical space for optimization, but compared to Seq2Seq, which has strong generalization ability in text space Model, this kind of numerical reward model has weak generalization ability in the text space, which leads to the unstable effect of RLHF on different models.

Aligner learns the difference (residual error) between aligned and unaligned answers by training a Seq2Seq model, thereby effectively avoiding the RLHF process and achieving better results than RLHF More generalizable performance.

Aligner vs. Prompt Engineering

Prompt Engineering is a common method to stimulate the capabilities of LLMs. However, there are some key problems with this method, such as: it is difficult to design prompts, and different designs need to be carried out for different models. The final effect depends on the capabilities of the model. When the capabilities of the model are not enough to solve the task, multiple iterations may be required, wasting context. Window, the limited context window of small models will affect the effect of prompt word engineering, and for large models, occupying too long context greatly increases the cost of training.

Aligner itself can support the alignment of any model. After one training, it can align 11 different types of models without occupying the context window of the original model. It is worth noting that Aligner can be seamlessly combined with existing prompt word engineering methods to achieve 1 1>2 effects.

In general: Aligner shows the following significant advantages:

1.Aligner Training is simpler. Compared with RLHF’s complex reward model learning and reinforcement learning (RL) fine-tuning process based on this model, Aligner’s implementation process is more direct and easy to operate. Looking back at the multiple engineering parameter adjustment details involved in RLHF and the inherent instability and hyperparameter sensitivity of the RL algorithm, Aligner greatly simplifies the engineering complexity.

#2.Aligner has less training data and obvious alignment effect. Training an Aligner-7B model based on 20K data can improve the helpfulness of GPT-4 by 12% and the security by 26%, and improve the helpfulness of the Vicuna 33B model by 29% and 45.3 % security, while RLHF requires more preference data and refined parameter adjustment to achieve this effect.

3.Aligner does not need to touch the model weights. While RLHF has proven effective in model alignment, it relies on direct training of the model. The applicability of RLHF is limited in the face of non-open source API-based models such as GPT-4 and their fine-tuning requirements in downstream tasks. In contrast, Aligner does not require direct manipulation of the original parameters of the model and achieves flexible alignment by externalizing the alignment requirements in an independent alignment module.

4.Aligner is insensitive to model type. Under the RLHF framework, fine-tuning different models (such as Llama2, Alpaca) not only requires re-collection of preference data, but also requires adjustment of training parameters in the reward model training and RL phases. Aligner can support the alignment of any model through one-time training. For example, by only needing to be trained once on a rectified dataset, Aligner-7B can align 11 different models (including open source models, API models such as GPT) and improve performance by 21.9% and 23.8% in terms of helpfulness and safety respectively.

5.Aligner’s demand for training resources is more flexible. RLHF Fine-tuning a 70B model is still extremely computationally demanding, requiring hundreds of GPU cards to perform. Because the RLHF method also requires additional loading of reward models, actor models, and critic models that are equivalent to the number of model parameters. Therefore, in terms of training resource consumption per unit time, RLHF actually requires more computing resources than pre-training.

In comparison, Aligner provides a more flexible training strategy, allowing users to flexibly choose the training scale of Aligner based on their actual computing resources. For example, for the alignment requirement of a 70B model, users can choose Aligner models of different sizes (7B, 13B, 70B, etc.) based on the actual available resources to achieve effective alignment of the target model.

This flexibility not only reduces the absolute demand for computing resources, but also provides users with the possibility of efficient alignment under limited resources.

Weak-to-strong Generalization

# #Weak-to-strong generalization The issue discussed is whether the labels of the weak model can be used to train a strong model, so that the performance of the strong model can be improved. OpenAI uses this analogy to solve the problem of SuperAlignment. Specifically, they use ground truth labels to train weak models.

OpenAI researchers conducted some preliminary experiments. For example, on the task of text classification (text classification), the training data set was divided into two parts, the input in the first half and the true value. The labels are used to train the weak model, while the second half of the training data only retains the input, labels produced by the weak model. Only the weak labels produced by the weak model are used to provide supervision signals for the strong model when training the strong model.

The purpose of training a weak model using true value labels is to enable the weak model to gain the ability to solve the corresponding task, but the input used to generate weak labels and the input used to train the weak model are not the same. This paradigm is similar to the concept of "teaching", that is, using weak models to guide strong models.

The author proposes a novel weak-to-strong generalization paradigm based on the properties of Aligner.

The author's core point is to let Aligner act as a "supervisor standing on the shoulders of giants." Unlike OpenAI's method of directly supervising the "giant", Aligner will modify stronger models through weak to strong corrections to provide more accurate labels in the process.

Specifically, during Aligner’s training process, the rectified data contains GPT-4, human annotators, and larger model annotations. Subsequently, the author uses Aligner to generate weak labels (i.e. corrections) on the new Q-A data set; and then uses the weak labels to fine-tune the original model.

Experimental results show that this paradigm can further improve the alignment performance of the model.

Experimental results

Aligner vs SFT/RLHF/DPO

##The author uses Aligner’s Query -Answer-Correction training data set, fine-tuning Alpaca-7B through SFT/RLHF/DPO method respectively.

When performing performance evaluation, the open source BeaverTails and HarmfulQA test prompt data sets are used, and the answers generated by the fine-tuned model and the answers to the original Alpaca-7B model are corrected using Aligner. The generated answers, compared in terms of helpfulness and security, are as follows:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Experimental results show that Aligner compares to SFT/RLHF/DPO Such a mature LLM alignment paradigm has obvious advantages, and is significantly ahead in both indicators of helpfulness and safety.

Analyzing specific experimental cases, it can be found that the alignment model fine-tuned using the RLHF/DPO paradigm may be more inclined to produce conservative answers in order to improve security, but it cannot take security into account in the process of improving helpfulness. sex, leading to an increase in dangerous information in answers.

Aligner vs Prompt Engineering

##Comparison of Aligner-13B and CAI/Self-Critique methods on the same upstream model Performance improvement, the experimental results are shown in the figure below: Aligner-13B improves GPT-4 in both helpfulness and security than the CAI/Self-Critique method, which shows that the Aligner paradigm has more advantages than the commonly used prompt engineering method. obvious advantage.

It is worth noting that CAI prompts are only used during reasoning in the experiment to encourage them to self-modify their answers, which is also one of the forms of Self-Refine.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

In addition, the authors also conducted further exploration. They corrected the answers using the CAI method through Aligner, and After direct comparison of the answers before and after Aligner, the experimental results are shown in the figure below.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Method A：CAI Aligner Method B：CAI only

Use Aligner to correct CAI After the second revision of the answer, the answer has been significantly improved in terms of helpfulness without losing security. This shows that Aligner is not only highly competitive when used alone, but can also be combined with other existing alignment methods to further improve its performance.

Weak-to-strong Generalization

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Method: weak-to -strong The training data set consists of (q, a, a′) triples, where q represents the questions from the Aligner training data set - 50K, a represents the answer generated by the Alpaca-7B model, and a′ represents the Aligner-7B given Alignment answer (q, a). Unlike SFT, which only utilizes a′ as the ground truth label, in RLHF and DPO training, a′ is considered better than a.

The author used Aligner to correct the original answer on the new Q-A data set, used the corrected answer as a weak label, and used these weak labels as supervision signals to train a larger model. . This process is similar to OpenAI’s training paradigm.

The author trains strong models based on weak labels through three methods: SFT, RLHF and DPO. The experimental results in the table above show that when the upstream model is fine-tuned through SFT, the weak labels of Aligner-7B and Aligner-13B improve the performance of the Llama2 series of strong models in all scenarios.

Outlook: Potential research directions of Aligner

As an innovative alignment method, Aligner has huge research potential. In the paper, the author proposed several Aligner application scenarios, including:

1. Application of multi-turn dialogue scenarios. In multi-round conversations, the challenge of facing sparse rewards is particularly prominent. In question-and-answer conversations (QA), supervision signals in scalar form are typically only available at the end of the conversation.

This sparsity problem will be further amplified in multiple rounds of dialogue (such as continuous QA scenarios), making it difficult for reinforcement learning-based human feedback (RLHF) to be effective. Investigating Aligner’s potential to improve dialogue alignment across multiple rounds is an area worthy of further exploration.

#2. Alignment of human values to the reward model. In the multi-stage process of building reward models based on human preferences and fine-tuning large language models (LLMs), there are huge challenges in ensuring that LLMs are aligned with specific human values (e.g. fairness, empathy, etc.) challenge.

By handing over the value alignment task to the Aligner alignment module outside the model, and using specific corpus to train Aligner, it not only provides new ideas for value alignment, but also enables Aligner to correct the previous Set the model's output to reflect specific values.

3. Streaming and parallel processing of MoE-Aligner. By specializing and integrating Aligners, you can create a more powerful and comprehensive hybrid expert (MoE) Aligner that can meet multiple hybrid security and value alignment needs. At the same time, further improving Aligner’s parallel processing capabilities to reduce the loss of inference time is a feasible development direction.

#4. Fusion during model training. By integrating the Aligner layer after a specific weight layer, real-time intervention in the output during model training can be achieved. This method not only improves alignment efficiency, but also helps optimize the model training process and achieve more efficient model alignment.

Team Introduction

This work was independently completed by Yang Yaodong’s research team at the AI Security and Governance Center of the Institute of Artificial Intelligence of Peking University. The team is deeply involved in the alignment technology of large language models, including the open source million-level safe alignment preference data set BeaverTails (NeurIPS 2023) and the safe alignment algorithm SafeRLHF (ICLR 2024 Spotlight) for large language models. Related technologies have been adopted by multiple open source models. Wrote the industry's first comprehensive review of artificial intelligence alignment and paired it with the resource website www.alignmentsurvey.com (click on the original text to jump directly), systematically expounding on the four perspectives of Learning from Feedback, Learning under Distribution Shift, Assurance, and Governance. AI alignment problem below. The team’s views on alignment and super-alignment were featured on the cover of the 2024 issue 5 of Sanlian Life Weekly.

The above is the detailed content of Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

undress free porn AI tool websiteMay 13, 2025 am 11:26 AM

https://undressaitool.ai/ is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How to create pornographic images/videos using undressAIMay 13, 2025 am 11:26 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

undress AI official website entrance website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How does undressAI generate pornographic images/videos?May 13, 2025 am 11:26 AM

undressAI porn AI official website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

UndressAI usage tutorial guide articleMay 13, 2025 am 10:43 AM

[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyrightMay 13, 2025 am 01:57 AM

The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

Explaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsMay 13, 2025 am 01:53 AM

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.