Home > Article > Technology peripherals > Is the "RL" in RLHF required? Some people use binary cross entropy to directly fine-tune LLM, and the effect is better.
Recently, unsupervised language models trained on large datasets have achieved surprising capabilities. However, these models are trained on data generated by humans with a variety of goals, priorities, and skill sets, some of which are not necessarily expected to be imitated.
Selecting a model’s desired responses and behaviors from its very broad knowledge and capabilities is critical to building safe, high-performance, and controllable AI systems. Many existing methods instill desired behaviors into language models by using carefully curated human preference sets that represent the types of behaviors that humans consider safe and beneficial. This preference learning stage occurs on large textual data sets. After an initial phase of large-scale unsupervised pre-training.
While the most straightforward preference learning method is supervised fine-tuning of high-quality responses demonstrated by humans, a relatively popular class of methods recently is from human (or artificial intelligence) feedback. Perform reinforcement learning (RLHF/RLAIF). The RLHF method matches a reward model to a dataset of human preferences and then uses RL to optimize a language model policy to produce responses that assign high rewards without excessively deviating from the original model.
While RLHF produces models with impressive conversational and coding capabilities, the RLHF pipeline is much more complex than supervised learning, involving training multiple language models and looping through training Sampling from language model policies incurs a large computational cost.
And a recent study shows that:The RL-based objective used by existing methods can be accurately optimized with a simple binary cross-entropy objective, thus greatly improving the Simplified preference learning pipeline. That is, it is entirely possible to directly optimize language models to adhere to human preferences without the need for explicit reward models or reinforcement learning.
Paper link: https://arxiv.org/pdf/2305.18290 .pdf
Researchers from Stanford University and other institutions proposed Direct Preference Optimization (DPO). This algorithm implicitly optimizes the existing RLHF algorithm. Same goal (reward maximization with KL - divergence constraints), but simple to implement and straightforward to train.
Experiments show that DPO is at least as effective as existing methods, including those based on RLHF of PPO.
Like existing algorithms, DPO also relies on theoretical preference models (such as the Bradley-Terry model) to measure a given How well the reward function fits empirical preference data. However, existing methods use a preference model to define a preference loss to train a reward model and then train a policy that optimizes the learned reward model, whereas DPO uses changes in variables to directly define the preference loss as a function of the policy. Given the human preference dataset for model responses, DPO can therefore optimize the policy using a simple binary cross-entropy objective without the need to explicitly learn a reward function or sample from the policy during training.
The DPO update increases the relative log probability of preferred responses versus non-preferred responses, but it includes a dynamic, per-sample importance weight to prevent model degradation, The researchers found that this degradation occurs for a naive probability ratio target.
In order to understand DPO mechanistically, it is useful to analyze the gradient of the loss function . The gradient with respect to parameter θ can be written as:
where is the reward implicitly defined by the language model and the reference model . Intuitively, the gradient of the loss function increases the likelihood of the preferred completion y_w and decreases the likelihood of the non-preferred completion y_l.
Importantly, the weight of these samples is determined by the implicit reward modelThe evaluation of disliked completion is determined by β is the scale, that is, how incorrect the implicit reward model is in ranking completion, which is also a reflection of the KL constraint strength. Experiments demonstrate the importance of this weighting, as a naive version of this method without weighting coefficients leads to a degradation of the language model (Appendix Table 2).
In Chapter 5 of the paper, the researcher further explained the DPO method, provided theoretical support, and compared the advantages of DPO with the Actor-Critic algorithm for RLHF ( Such as PPO) issues. Specific details can be found in the original paper.
In the experiment, the researchers evaluated the ability of DPO to train policies directly based on preferences.
First, in a well-controlled text generation environment, they considered the question: Compared with common preference learning algorithms such as PPO, DPO trades off reward maximization in the reference policy How efficient is KL-divergence minimization? We then evaluated DPO's performance on larger models and more difficult RLHF tasks, including summarization and dialogue.
It was ultimately found that with almost no hyperparameter tuning, DPO often performed as well as, or even better than, powerful baselines such as RLHF with PPO, while learning rewards The function returns the best N sampling trajectory results.
In terms of tasks, the researchers explored three different open-ended text generation tasks. In all experiments, the algorithm learns policies from the preference dataset .
In controlled emotion generation, x is the prefix of a movie review from the IMDb dataset and the policy must generate y with positive emotion. For comparative evaluation, the experiment uses a pre-trained sentiment classifier to generate preference pairs, where .
#For SFT, the researchers fine-tuned GPT-2-large until it converged on the comments on the training split of the IMDB dataset. In summary, x is a forum post from Reddit, and the strategy must generate a summary of the key points in the post. Building on previous work, experiments use the Reddit TL;DR summary dataset and human preferences collected by Stiennon et al. The experiments also used an SFT model fine-tuned based on human-written forum article summaries 2 and RLHF’s TRLX framework. The human preference dataset is a sample collected from a different but similarly trained SFT model by Stiennon et al.
Finally, in a single-turn conversation, x is a human question that can be anything from astrophysics to relationship advice. A policy must provide an engaging and helpful response to the user's query; the policy must provide an interesting and helpful response to the user's query; the experiment uses the Anthropic Helpful and Harmless conversation set, which contains between human and automated assistants of 170k conversations. Each text ends with a pair of responses generated by a large (albeit unknown) language model and a preference label representing the human-preferred response. In this case, no pretrained SFT model is available. Therefore, experiments fine-tune off-the-shelf language models only on preferred completions to form SFT models.
The researchers used two assessment methods. To analyze the efficiency of each algorithm in optimizing the constrained reward maximization goal, experiments evaluate each algorithm by its bounds on achieving rewards and KL-divergence from a reference strategy in a controlled emotion generation environment. Experiments can use ground-truth reward functions (sentiment classifiers), so this bound can be calculated. But in fact, the ground truth reward function is unknown. We therefore evaluate the algorithm's win rate by the win rate of the baseline strategy, and use GPT-4 as a proxy for human assessment of summary quality and response usefulness in summarization and single-round dialogue settings. For abstracts, the experiment uses the reference abstract in the test machine as the limit; for dialogue, the preferred response in the test data set is selected as the baseline. While existing research suggests that language models can be better automatic evaluators than existing metrics, the researchers conducted a human study that demonstrated the feasibility of using GPT-4 for evaluation. GPT-4 judged strongly with humans. The correlation between humans and GPT-4 is generally similar to or higher than the agreement between human annotators.
In addition to DPO, the researchers also evaluated several existing training language models to maintain parity with human preferences consistent. At its simplest, the experiments explore GPT-J’s zero-shot prompts on the summary task and Pythia-2.8B’s 2-shot prompts on the dialogue task. Additionally, experiments evaluate the SFT model and Preferred-FT. Preferred-FT is a model fine-tuned via supervised learning on completions y_w selected from SFT models (controlled sentiment and summarization) or general language models (single-turn dialogue). Another pseudo-supervised method is Unlikelihood, which simply optimizes the policy to maximize the probability assigned to y_w and minimize the probability assigned to y_l. The experiment uses an optional coefficient α∈[0,1] on “Unlikehood”. They also considered PPO, using a reward function learned from preference data, and PPO-GT. PPO-GT is an oracle learned from ground truth reward functions available in controlled emotion settings. In the emotion experiments, the team used two implementations of PPO-GT, an off-the-shelf version, and a modified version. The latter normalizes the rewards and further tunes the hyperparameters to improve performance (the experiments also used these modifications when running "Normal" PPO with learning rewards). Finally, we consider the best of N baselines, sample N responses from the SFT model (or Preferred-FT in conversational terms), and return the highest-scoring response based on a reward function learned from the preference dataset. This high-performance approach decouples reward model quality from PPO optimization, but is computationally impractical even for moderate N since it requires N sample completions per query at test time.
Figure 2 shows the reward KL bounds for various algorithms in the emotion setting.
Figure 3 shows that DPO converges to its optimal performance relatively quickly.
For more research details, please refer to the original paper.
The above is the detailed content of Is the "RL" in RLHF required? Some people use binary cross entropy to directly fine-tune LLM, and the effect is better.. For more information, please follow other related articles on the PHP Chinese website!