


In order to align large language models (LLMs) with human values and intentions, it is critical to learn human feedback to ensure that they are useful, honest, and harmless. In terms of aligning LLMs, an effective approach is reinforcement learning based on human feedback (RLHF). Although the results of the RLHF method are excellent, there are some optimization challenges involved. This involves training a reward model and then optimizing a policy model to maximize that reward.
Recently, some researchers have explored simpler offline algorithms, one of which is direct preference optimization (DPO). DPO learns a policy model directly based on preference data by parameterizing the reward function in RLHF, thus eliminating the need for an explicit reward model. This method is simple and stable and has been widely used in practice.
When using DPO, the way to obtain implicit rewards is to use the logarithm of the response likelihood ratio between the current policy model and the supervised fine-tuning (SFT) model. However, this way of constructing rewards does not align directly with the bootstrap-generated metric, which is approximately the mean logarithm of the response generated by the policy model. This difference between training and inference can lead to poor performance.
To this end, Meng Rui, an assistant professor at the University of Virginia, Xia Mengzhou, a doctoral candidate at Princeton University, and Chen Danqi, an assistant professor, jointly proposed SimPO - a simple and effective offline preference optimization algorithm. The design of SimPO is based on modeling the optimization problem as a minimization problem of a continuous black-box function. Through continuous iteration, SimPO is able to find the best optimization strategy and achieve efficient convergence. Compared with traditional optimization algorithms,
- ##Paper title: SimPO: Simple Preference Optimization with a Reference-Free Reward
- Paper address: https://arxiv.org/pdf/2405.14734
- Code & Model: https://github.com/princeton-nlp/SimPO
To sum up, SimPO has the following characteristics:
- Simple: SimPO does not require a reference model, so it is more dependent on references than DPO and other Model methods are lighter and easier to implement.
- Clear performance advantage: Despite its simplicity, SimPO performs significantly better than DPO and its latest variants (such as the recent reference-free target ORPO). As shown in Figure 1. And SimPO has stable advantages across different training settings and multiple command compliance benchmarks (including AlpacaEval 2 and the difficult Arena-Hard benchmark).
- Minimize length utilization: Compared with SFT or DPO models, SimPO does not significantly increase the response length (see Table 1), which shows that its length utilization is minimal.
SimPO: Simple Preference Optimization
For ease of understanding, the following first introduces the background of DPO, and then explains the rewards of DPO and the similarities used in generation. The difference between the natural metrics and proposes a reference-free alternative reward formula to alleviate this problem. Finally, the SimPO target is derived by integrating the target reward margin term into the Bradley-Terry model.
Background: Direct Preference Optimization (DPO) DPO is one of the most commonly used offline preference optimization methods. DPO does not learn an explicit reward model, but uses a closed expression with an optimal policy to reparameterize the reward function r: where π_θ is the policy model, π_ref is the reference policy (usually the SFT model), and Z (x) is the partition function. By integrating this way of building rewards into the Bradley-Terry (BT) ranking objective, ##where (x, y_w, y_l) is the preference pair consisting of prompt, winning response and failure response from the preference data set D. A simple reference-free reward aligned with the generated result The difference between DPO’s rewards and generation . Using equation (1) as an implicit reward expression has the following disadvantages: (1) The training phase requires a reference model π_ref, which will bring additional memory and computing costs; (2) The reward optimized in the training phase and the generation used for inference There are differences between indicators. Specifically, in the generation stage, the policy model π_θ is used to generate a sequence that can approximately maximize the average log-likelihood, defined as follows: It is very difficult to directly maximize this metric during the decoding process. Various decoding strategies can be used for this, such as greedy decoding, beam search, kernel sampling and top-k sampling. Additionally, this metric is often used to rank options when language models perform multi-selection tasks. In DPO, for any triplet (x, y_w, y_l), satisfying the reward ranking r (x, y_w) > r (x, y_l) does not necessarily mean satisfying the likelihood ranking Construct rewards normalized over length. Naturally, we would consider using p_θ in (3) to replace the reward construction in DPO so that it aligns with the bootstrap-generated likelihood metric. This results in a reward normalized in length: where β is a constant that controls the size of the reward difference. The team found that normalizing rewards based on response length is critical; removing the length normalization term from the reward formula caused the model to tend to generate longer but lower-quality sequences. This eliminates the need for a reference model in building rewards, resulting in greater memory and computational efficiency than algorithms that rely on reference models. SimPO Target Target reward difference. In addition, the team also introduced a target reward difference term γ > 0 for the Bradley-Terry objective to ensure that the reward r (x, y_w) of the winning response exceeds the reward r (x, y_l) of the failed response by at least γ: #The difference between two classes is known to affect the generalization ability of the classifier. In standard training settings using random model initialization, increasing the target margin usually improves generalization performance. In preference optimization, these two categories are winning or losing responses to a single input. In practice, the team observed that as the target gap increases, the generation quality initially improves, but when the gap becomes too large, the generation quality decreases. A variant of the DPO, the IPO, also builds a target reward margin similar to SimPO, but its overall target is less effective than SimPO. Target. Finally, by substituting equation (4) into equation (5), the SimPO target can be obtained: To sum up, SimPO adopts and generates A form of implicit reward where metrics align directly, thus eliminating the need for a reference model. Additionally, it introduces a target reward difference γ to separate winning and losing responses. Model and training settings. The team's experiments used two types of models, Llama3-8B and Mistral-7B, in both Base and Instruct settings. Evaluation benchmark. The team used three of the most commonly used open compliance benchmarks: MT-Bench, AlpacaEval 2, and Arena-Hard v0.1. These benchmarks evaluate a model's diverse conversational capabilities on a variety of queries and have been widely adopted by the community. Table 2 gives some details. Baseline method. Table 3 lists other offline preference optimization methods compared with SimPO. ##Main results and ablation studies SimPO always performs significantly better than previously existing preference optimization methods. As shown in Table 4, although all preference optimization algorithms perform better than the SFT model, simple SimPO achieves the best performance on all benchmarks and settings. Such a large lead across the board demonstrates the robustness and effectiveness of SimPO. Benchmark quality varies. It can be observed that the win rate on Arena-Hard is significantly lower than the win rate on AlpacaEval 2, indicating that Arena-Hard is a more difficult benchmark. Instruct settings result in significant performance gains. As can be seen, the Instruct setup outperforms the Base setup across the board on all benchmarks. This may be due to the use of higher quality SFT models for initialization by these models and the higher quality of preference data generated by these models. Two key design aspects of SimPO are important. Table 5 shows the results of ablation experiments for each key design of SimPO. (1) Remove the length normalization (i.e. w/o LN) in (4); (2) Set the target reward difference in (6) to 0 (i.e. γ = 0). #Removing length normalization has the greatest impact on the results. The team's research found that this resulted in the model generating long and repetitive patterns, which severely reduced the overall quality of the output. Setting γ to 0 also leads to performance degradation of SimPO, indicating that 0 is not the optimal target reward margin. See the original paper for a more in-depth analysis of these two design choices. In-depth comparison between DPO and SimPO Finally, the team also analyzed the DPO and SimPO are comprehensively compared: (1) likelihood-length correlation, (2) reward construction, (3) reward accuracy, (4) algorithm efficiency. The results show that SimPO outperforms DPO in terms of accuracy and efficiency. DPO rewards implicitly promote length normalization. Although the DPO reward expression DPO reward does not match the generated likelihood. #There is a difference between the DPO's reward and the average log-likelihood metric, which directly affects the generation . As shown in Figure 4b, in the instance on the UltraFeedback training set, where DPO is not as good as SimPO in terms of reward accuracy. Figure 4c compares the reward accuracy of SimPO and DPO, which evaluates their final learned reward versus the preference label on the holdout set degree of alignment. It can be observed that the reward accuracy of SimPO is higher than that of DPO, which indicates that the reward design of SimPO helps achieve more effective generalization and higher quality generation. SimPO is both more memory efficient and computationally efficient than DPO. Another big advantage of SimPO is efficiency, after all, it does not use a reference model. Figure 4d presents the overall runtime and peak memory usage per GPU for SimPO and DPO when using the Llama3-Base setup on an 8×H100 GPU. SimPO reduces runtime by approximately 20% and GPU memory usage by approximately 10% compared to the original DPO implementation, thanks to the elimination of forward passes using the reference model. For more details, please read the original article. , the DPO can use a policy model instead of a reward model to represent the probabilities of preference data, resulting in the following objective:
. In fact, when training with DPO, only about 50% of the triplets in the holdout set meet this condition (see Figure 4b).
Experimental settings
Experimental results
(does not include the partition function) lacks a length reduction function An explicit term for normalization, but the logarithmic ratio between the policy model and the reference model can implicitly offset the length bias. As shown in Table 6 and Figure 4a, using DPO reduces the Spearman correlation coefficient between the average log-likelihood and response length compared to the method without any length normalization (denoted as SimPO w/o LN). . However, it still shows a stronger positive correlation when compared to SimPO.
, almost half of the data pairs have
. In contrast, SimPO directly uses the average log-likelihood (scaled by β) as the reward expression, thereby completely eliminating the difference.
The above is the detailed content of Comprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model. For more information, please follow other related articles on the PHP Chinese website!

In John Rawls' seminal 1971 book The Theory of Justice, he proposed a thought experiment that we should take as the core of today's AI design and use decision-making: the veil of ignorance. This philosophy provides a simple tool for understanding equity and also provides a blueprint for leaders to use this understanding to design and implement AI equitably. Imagine that you are making rules for a new society. But there is a premise: you don’t know in advance what role you will play in this society. You may end up being rich or poor, healthy or disabled, belonging to a majority or marginal minority. Operating under this "veil of ignorance" prevents rule makers from making decisions that benefit themselves. On the contrary, people will be more motivated to formulate public

Numerous companies specialize in robotic process automation (RPA), offering bots to automate repetitive tasks—UiPath, Automation Anywhere, Blue Prism, and others. Meanwhile, process mining, orchestration, and intelligent document processing speciali

The future of AI is moving beyond simple word prediction and conversational simulation; AI agents are emerging, capable of independent action and task completion. This shift is already evident in tools like Anthropic's Claude. AI Agents: Research a

Rapid technological advancements necessitate a forward-looking perspective on the future of work. What happens when AI transcends mere productivity enhancement and begins shaping our societal structures? Topher McDougal's upcoming book, Gaia Wakes:

Product classification, often involving complex codes like "HS 8471.30" from systems such as the Harmonized System (HS), is crucial for international trade and domestic sales. These codes ensure correct tax application, impacting every inv

The future of energy consumption in data centers and climate technology investment This article explores the surge in energy consumption in AI-driven data centers and its impact on climate change, and analyzes innovative solutions and policy recommendations to address this challenge. Challenges of energy demand: Large and ultra-large-scale data centers consume huge power, comparable to the sum of hundreds of thousands of ordinary North American families, and emerging AI ultra-large-scale centers consume dozens of times more power than this. In the first eight months of 2024, Microsoft, Meta, Google and Amazon have invested approximately US$125 billion in the construction and operation of AI data centers (JP Morgan, 2024) (Table 1). Growing energy demand is both a challenge and an opportunity. According to Canary Media, the looming electricity

Generative AI is revolutionizing film and television production. Luma's Ray 2 model, as well as Runway's Gen-4, OpenAI's Sora, Google's Veo and other new models, are improving the quality of generated videos at an unprecedented speed. These models can easily create complex special effects and realistic scenes, even short video clips and camera-perceived motion effects have been achieved. While the manipulation and consistency of these tools still need to be improved, the speed of progress is amazing. Generative video is becoming an independent medium. Some models are good at animation production, while others are good at live-action images. It is worth noting that Adobe's Firefly and Moonvalley's Ma

ChatGPT user experience declines: is it a model degradation or user expectations? Recently, a large number of ChatGPT paid users have complained about their performance degradation, which has attracted widespread attention. Users reported slower responses to models, shorter answers, lack of help, and even more hallucinations. Some users expressed dissatisfaction on social media, pointing out that ChatGPT has become “too flattering” and tends to verify user views rather than provide critical feedback. This not only affects the user experience, but also brings actual losses to corporate customers, such as reduced productivity and waste of computing resources. Evidence of performance degradation Many users have reported significant degradation in ChatGPT performance, especially in older models such as GPT-4 (which will soon be discontinued from service at the end of this month). this


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Dreamweaver Mac version
Visual web development tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function
