


Is the 'RL' in RLHF required? Some people use binary cross entropy to directly fine-tune LLM, and the effect is better.
Recently, unsupervised language models trained on large datasets have achieved surprising capabilities. However, these models are trained on data generated by humans with a variety of goals, priorities, and skill sets, some of which are not necessarily expected to be imitated.
Selecting a model’s desired responses and behaviors from its very broad knowledge and capabilities is critical to building safe, high-performance, and controllable AI systems. Many existing methods instill desired behaviors into language models by using carefully curated human preference sets that represent the types of behaviors that humans consider safe and beneficial. This preference learning stage occurs on large textual data sets. After an initial phase of large-scale unsupervised pre-training.
While the most straightforward preference learning method is supervised fine-tuning of high-quality responses demonstrated by humans, a relatively popular class of methods recently is from human (or artificial intelligence) feedback. Perform reinforcement learning (RLHF/RLAIF). The RLHF method matches a reward model to a dataset of human preferences and then uses RL to optimize a language model policy to produce responses that assign high rewards without excessively deviating from the original model.
While RLHF produces models with impressive conversational and coding capabilities, the RLHF pipeline is much more complex than supervised learning, involving training multiple language models and looping through training Sampling from language model policies incurs a large computational cost.
And a recent study shows that:The RL-based objective used by existing methods can be accurately optimized with a simple binary cross-entropy objective, thus greatly improving the Simplified preference learning pipeline. That is, it is entirely possible to directly optimize language models to adhere to human preferences without the need for explicit reward models or reinforcement learning.
Paper link: https://arxiv.org/pdf/2305.18290 .pdf
Researchers from Stanford University and other institutions proposed Direct Preference Optimization (DPO). This algorithm implicitly optimizes the existing RLHF algorithm. Same goal (reward maximization with KL - divergence constraints), but simple to implement and straightforward to train.
Experiments show that DPO is at least as effective as existing methods, including those based on RLHF of PPO.
DPO algorithm
Like existing algorithms, DPO also relies on theoretical preference models (such as the Bradley-Terry model) to measure a given How well the reward function fits empirical preference data. However, existing methods use a preference model to define a preference loss to train a reward model and then train a policy that optimizes the learned reward model, whereas DPO uses changes in variables to directly define the preference loss as a function of the policy. Given the human preference dataset for model responses, DPO can therefore optimize the policy using a simple binary cross-entropy objective without the need to explicitly learn a reward function or sample from the policy during training.
The DPO update increases the relative log probability of preferred responses versus non-preferred responses, but it includes a dynamic, per-sample importance weight to prevent model degradation, The researchers found that this degradation occurs for a naive probability ratio target.
In order to understand DPO mechanistically, it is useful to analyze the gradient of the loss function . The gradient with respect to parameter θ can be written as:
where is the reward implicitly defined by the language model
and the reference model
. Intuitively, the gradient of the loss function
increases the likelihood of the preferred completion y_w and decreases the likelihood of the non-preferred completion y_l.
Importantly, the weight of these samples is determined by the implicit reward modelThe evaluation of disliked completion is determined by β is the scale, that is, how incorrect the implicit reward model is in ranking completion, which is also a reflection of the KL constraint strength. Experiments demonstrate the importance of this weighting, as a naive version of this method without weighting coefficients leads to a degradation of the language model (Appendix Table 2).
In Chapter 5 of the paper, the researcher further explained the DPO method, provided theoretical support, and compared the advantages of DPO with the Actor-Critic algorithm for RLHF ( Such as PPO) issues. Specific details can be found in the original paper.
Experiment
In the experiment, the researchers evaluated the ability of DPO to train policies directly based on preferences.
First, in a well-controlled text generation environment, they considered the question: Compared with common preference learning algorithms such as PPO, DPO trades off reward maximization in the reference policy How efficient is KL-divergence minimization? We then evaluated DPO's performance on larger models and more difficult RLHF tasks, including summarization and dialogue.
It was ultimately found that with almost no hyperparameter tuning, DPO often performed as well as, or even better than, powerful baselines such as RLHF with PPO, while learning rewards The function returns the best N sampling trajectory results.
In terms of tasks, the researchers explored three different open-ended text generation tasks. In all experiments, the algorithm learns policies from the preference dataset .
In controlled emotion generation, x is the prefix of a movie review from the IMDb dataset and the policy must generate y with positive emotion. For comparative evaluation, the experiment uses a pre-trained sentiment classifier to generate preference pairs, where .
#For SFT, the researchers fine-tuned GPT-2-large until it converged on the comments on the training split of the IMDB dataset. In summary, x is a forum post from Reddit, and the strategy must generate a summary of the key points in the post. Building on previous work, experiments use the Reddit TL;DR summary dataset and human preferences collected by Stiennon et al. The experiments also used an SFT model fine-tuned based on human-written forum article summaries 2 and RLHF’s TRLX framework. The human preference dataset is a sample collected from a different but similarly trained SFT model by Stiennon et al.
Finally, in a single-turn conversation, x is a human question that can be anything from astrophysics to relationship advice. A policy must provide an engaging and helpful response to the user's query; the policy must provide an interesting and helpful response to the user's query; the experiment uses the Anthropic Helpful and Harmless conversation set, which contains between human and automated assistants of 170k conversations. Each text ends with a pair of responses generated by a large (albeit unknown) language model and a preference label representing the human-preferred response. In this case, no pretrained SFT model is available. Therefore, experiments fine-tune off-the-shelf language models only on preferred completions to form SFT models.
The researchers used two assessment methods. To analyze the efficiency of each algorithm in optimizing the constrained reward maximization goal, experiments evaluate each algorithm by its bounds on achieving rewards and KL-divergence from a reference strategy in a controlled emotion generation environment. Experiments can use ground-truth reward functions (sentiment classifiers), so this bound can be calculated. But in fact, the ground truth reward function is unknown. We therefore evaluate the algorithm's win rate by the win rate of the baseline strategy, and use GPT-4 as a proxy for human assessment of summary quality and response usefulness in summarization and single-round dialogue settings. For abstracts, the experiment uses the reference abstract in the test machine as the limit; for dialogue, the preferred response in the test data set is selected as the baseline. While existing research suggests that language models can be better automatic evaluators than existing metrics, the researchers conducted a human study that demonstrated the feasibility of using GPT-4 for evaluation. GPT-4 judged strongly with humans. The correlation between humans and GPT-4 is generally similar to or higher than the agreement between human annotators.
In addition to DPO, the researchers also evaluated several existing training language models to maintain parity with human preferences consistent. At its simplest, the experiments explore GPT-J’s zero-shot prompts on the summary task and Pythia-2.8B’s 2-shot prompts on the dialogue task. Additionally, experiments evaluate the SFT model and Preferred-FT. Preferred-FT is a model fine-tuned via supervised learning on completions y_w selected from SFT models (controlled sentiment and summarization) or general language models (single-turn dialogue). Another pseudo-supervised method is Unlikelihood, which simply optimizes the policy to maximize the probability assigned to y_w and minimize the probability assigned to y_l. The experiment uses an optional coefficient α∈[0,1] on “Unlikehood”. They also considered PPO, using a reward function learned from preference data, and PPO-GT. PPO-GT is an oracle learned from ground truth reward functions available in controlled emotion settings. In the emotion experiments, the team used two implementations of PPO-GT, an off-the-shelf version, and a modified version. The latter normalizes the rewards and further tunes the hyperparameters to improve performance (the experiments also used these modifications when running "Normal" PPO with learning rewards). Finally, we consider the best of N baselines, sample N responses from the SFT model (or Preferred-FT in conversational terms), and return the highest-scoring response based on a reward function learned from the preference dataset. This high-performance approach decouples reward model quality from PPO optimization, but is computationally impractical even for moderate N since it requires N sample completions per query at test time.
Figure 2 shows the reward KL bounds for various algorithms in the emotion setting.
Figure 3 shows that DPO converges to its optimal performance relatively quickly.
For more research details, please refer to the original paper.
The above is the detailed content of Is the 'RL' in RLHF required? Some people use binary cross entropy to directly fine-tune LLM, and the effect is better.. For more information, please follow other related articles on the PHP Chinese website!

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function