search
HomeTechnology peripheralsAIComprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

In order to align large language models (LLMs) with human values ​​and intentions, it is critical to learn human feedback to ensure that they are useful, honest, and harmless. In terms of aligning LLMs, an effective approach is reinforcement learning based on human feedback (RLHF). Although the results of the RLHF method are excellent, there are some optimization challenges involved. This involves training a reward model and then optimizing a policy model to maximize that reward.

Recently, some researchers have explored simpler offline algorithms, one of which is direct preference optimization (DPO). DPO learns a policy model directly based on preference data by parameterizing the reward function in RLHF, thus eliminating the need for an explicit reward model. This method is simple and stable and has been widely used in practice.

When using DPO, the way to obtain implicit rewards is to use the logarithm of the response likelihood ratio between the current policy model and the supervised fine-tuning (SFT) model. However, this way of constructing rewards does not align directly with the bootstrap-generated metric, which is approximately the mean logarithm of the response generated by the policy model. This difference between training and inference can lead to poor performance.

To this end, Meng Rui, an assistant professor at the University of Virginia, Xia Mengzhou, a doctoral candidate at Princeton University, and Chen Danqi, an assistant professor, jointly proposed SimPO - a simple and effective offline preference optimization algorithm. The design of SimPO is based on modeling the optimization problem as a minimization problem of a continuous black-box function. Through continuous iteration, SimPO is able to find the best optimization strategy and achieve efficient convergence. Compared with traditional optimization algorithms,

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model


  • ##Paper title: SimPO: Simple Preference Optimization with a Reference-Free Reward
  • Paper address: https://arxiv.org/pdf/2405.14734
  • Code & Model: https://github.com/princeton-nlp/SimPO
##The The core of the algorithm is to align the reward function in the preference optimization objective with the generated metric. SimPO consists of two main components: (1) a reward normalized in length, calculated as the average log probability of all tokens in the reward using the policy model; (2) a target reward difference to ensure wins and losses The reward difference between responses exceeds this difference.

To sum up, SimPO has the following characteristics:

    Simple: SimPO does not require a reference model, so it is more dependent on references than DPO and other Model methods are lighter and easier to implement.
  • Clear performance advantage: Despite its simplicity, SimPO performs significantly better than DPO and its latest variants (such as the recent reference-free target ORPO). As shown in Figure 1. And SimPO has stable advantages across different training settings and multiple command compliance benchmarks (including AlpacaEval 2 and the difficult Arena-Hard benchmark).
  • Minimize length utilization: Compared with SFT or DPO models, SimPO does not significantly increase the response length (see Table 1), which shows that its length utilization is minimal.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

The team conducted extensive analysis and the results showed that SimPO can more effectively utilize preference data to achieve high-quality performance on the validation set. and more accurate ranking of the likelihood of low-quality responses, which further leads to better policy models.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

As shown in Table 1, the team built a model with top performance based on Llama3-8B-instruct, which was obtained on AlpacaEval 2 Its length-controlled win rate is 44.7, surpassing Claude 3 Opus on the leaderboard; in addition, its win rate on Arena-Hard is 33.8, making it the most powerful 8B open source model currently.

SimPO: Simple Preference Optimization

For ease of understanding, the following first introduces the background of DPO, and then explains the rewards of DPO and the similarities used in generation. The difference between the natural metrics and proposes a reference-free alternative reward formula to alleviate this problem. Finally, the SimPO target is derived by integrating the target reward margin term into the Bradley-Terry model.

Background: Direct Preference Optimization (DPO)

DPO is one of the most commonly used offline preference optimization methods. DPO does not learn an explicit reward model, but uses a closed expression with an optimal policy to reparameterize the reward function r:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model


where π_θ is the policy model, π_ref is the reference policy (usually the SFT model), and Z (x) is the partition function. By integrating this way of building rewards into the Bradley-Terry (BT) ranking objective, Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model, the DPO can use a policy model instead of a reward model to represent the probabilities of preference data, resulting in the following objective:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

##where (x, y_w, y_l) is the preference pair consisting of prompt, winning response and failure response from the preference data set D.

A simple reference-free reward aligned with the generated result

The difference between DPO’s rewards and generation . Using equation (1) as an implicit reward expression has the following disadvantages: (1) The training phase requires a reference model π_ref, which will bring additional memory and computing costs; (2) The reward optimized in the training phase and the generation used for inference There are differences between indicators. Specifically, in the generation stage, the policy model π_θ is used to generate a sequence that can approximately maximize the average log-likelihood, defined as follows:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

It is very difficult to directly maximize this metric during the decoding process. Various decoding strategies can be used for this, such as greedy decoding, beam search, kernel sampling and top-k sampling. Additionally, this metric is often used to rank options when language models perform multi-selection tasks. In DPO, for any triplet (x, y_w, y_l), satisfying the reward ranking r (x, y_w) > r (x, y_l) does not necessarily mean satisfying the likelihood rankingComprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model. In fact, when training with DPO, only about 50% of the triplets in the holdout set meet this condition (see Figure 4b).

Construct rewards normalized over length. Naturally, we would consider using p_θ in (3) to replace the reward construction in DPO so that it aligns with the bootstrap-generated likelihood metric. This results in a reward normalized in length:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

where β is a constant that controls the size of the reward difference. The team found that normalizing rewards based on response length is critical; removing the length normalization term from the reward formula caused the model to tend to generate longer but lower-quality sequences. This eliminates the need for a reference model in building rewards, resulting in greater memory and computational efficiency than algorithms that rely on reference models.

SimPO Target

Target reward difference. In addition, the team also introduced a target reward difference term γ > 0 for the Bradley-Terry objective to ensure that the reward r (x, y_w) of the winning response exceeds the reward r (x, y_l) of the failed response by at least γ:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

#The difference between two classes is known to affect the generalization ability of the classifier. In standard training settings using random model initialization, increasing the target margin usually improves generalization performance. In preference optimization, these two categories are winning or losing responses to a single input.

In practice, the team observed that as the target gap increases, the generation quality initially improves, but when the gap becomes too large, the generation quality decreases. A variant of the DPO, the IPO, also builds a target reward margin similar to SimPO, but its overall target is less effective than SimPO.

Target. Finally, by substituting equation (4) into equation (5), the SimPO target can be obtained:

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

To sum up, SimPO adopts and generates A form of implicit reward where metrics align directly, thus eliminating the need for a reference model. Additionally, it introduces a target reward difference γ to separate winning and losing responses.

Experimental settings

Model and training settings. The team's experiments used two types of models, Llama3-8B and Mistral-7B, in both Base and Instruct settings.

Evaluation benchmark. The team used three of the most commonly used open compliance benchmarks: MT-Bench, AlpacaEval 2, and Arena-Hard v0.1. These benchmarks evaluate a model's diverse conversational capabilities on a variety of queries and have been widely adopted by the community. Table 2 gives some details.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Baseline method. Table 3 lists other offline preference optimization methods compared with SimPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Experimental results

##Main results and ablation studies

SimPO always performs significantly better than previously existing preference optimization methods. As shown in Table 4, although all preference optimization algorithms perform better than the SFT model, simple SimPO achieves the best performance on all benchmarks and settings. Such a large lead across the board demonstrates the robustness and effectiveness of SimPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Benchmark quality varies. It can be observed that the win rate on Arena-Hard is significantly lower than the win rate on AlpacaEval 2, indicating that Arena-Hard is a more difficult benchmark.

Instruct settings result in significant performance gains. As can be seen, the Instruct setup outperforms the Base setup across the board on all benchmarks. This may be due to the use of higher quality SFT models for initialization by these models and the higher quality of preference data generated by these models.

Two key design aspects of SimPO are important. Table 5 shows the results of ablation experiments for each key design of SimPO. (1) Remove the length normalization (i.e. w/o LN) in (4); (2) Set the target reward difference in (6) to 0 (i.e. γ = 0).

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

#Removing length normalization has the greatest impact on the results. The team's research found that this resulted in the model generating long and repetitive patterns, which severely reduced the overall quality of the output. Setting γ to 0 also leads to performance degradation of SimPO, indicating that 0 is not the optimal target reward margin.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

See the original paper for a more in-depth analysis of these two design choices.

In-depth comparison between DPO and SimPO

Finally, the team also analyzed the DPO and SimPO are comprehensively compared: (1) likelihood-length correlation, (2) reward construction, (3) reward accuracy, (4) algorithm efficiency. The results show that SimPO outperforms DPO in terms of accuracy and efficiency.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

DPO rewards implicitly promote length normalization.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model


Although the DPO reward expression Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model (does not include the partition function) lacks a length reduction function An explicit term for normalization, but the logarithmic ratio between the policy model and the reference model can implicitly offset the length bias. As shown in Table 6 and Figure 4a, using DPO reduces the Spearman correlation coefficient between the average log-likelihood and response length compared to the method without any length normalization (denoted as SimPO w/o LN). . However, it still shows a stronger positive correlation when compared to SimPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

DPO reward does not match the generated likelihood.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model


#There is a difference between the DPO's reward and the average log-likelihood metric, which directly affects the generation . As shown in Figure 4b, in the instance on the UltraFeedback training set, where Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model, almost half of the data pairs have Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model. In contrast, SimPO directly uses the average log-likelihood (scaled by β) as the reward expression, thereby completely eliminating the difference.

DPO is not as good as SimPO in terms of reward accuracy.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Figure 4c compares the reward accuracy of SimPO and DPO, which evaluates their final learned reward versus the preference label on the holdout set degree of alignment. It can be observed that the reward accuracy of SimPO is higher than that of DPO, which indicates that the reward design of SimPO helps achieve more effective generalization and higher quality generation.

SimPO is both more memory efficient and computationally efficient than DPO.

Comprehensively surpassing DPO: Chen Danqis team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model

Another big advantage of SimPO is efficiency, after all, it does not use a reference model. Figure 4d presents the overall runtime and peak memory usage per GPU for SimPO and DPO when using the Llama3-Base setup on an 8×H100 GPU. SimPO reduces runtime by approximately 20% and GPU memory usage by approximately 10% compared to the original DPO implementation, thanks to the elimination of forward passes using the reference model.

For more details, please read the original article.


The above is the detailed content of Comprehensively surpassing DPO: Chen Danqi's team proposed simple preference optimization SimPO, and also refined the strongest 8B open source model. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
从VAE到扩散模型:一文解读以文生图新范式从VAE到扩散模型:一文解读以文生图新范式Apr 08, 2023 pm 08:41 PM

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了Apr 08, 2023 pm 06:21 PM

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

普林斯顿陈丹琦:如何让「大模型」变小普林斯顿陈丹琦:如何让「大模型」变小Apr 08, 2023 pm 04:01 PM

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉Transformer解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉TransformerApr 09, 2023 pm 02:01 PM

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Apr 07, 2023 pm 11:21 PM

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药Apr 09, 2023 pm 07:01 PM

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:​https://spj.scien

​什么是Transformer机器学习模型?​什么是Transformer机器学习模型?Apr 08, 2023 pm 06:31 PM

译者 | 李睿审校 | 孙淑娟​近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军Apr 09, 2023 pm 01:51 PM

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use