ICML 2024 Oral | Is DPO more suitable for LLM than PPO? Tsinghua Wuyi team's latest revelation-AI-php.cn

Home

Technology peripherals

ICML 2024 Oral | Is DPO more suitable for LLM than PPO? Tsinghua Wuyi team's latest revelation

王林

Jul 22, 2024 pm 06:41 PM

project

ICML 2024 Oral | DPO是否比PPO更适合LLM，清华吴翼团队最新揭秘

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

Wu Yi, assistant professor at the Institute of Interdisciplinary Information, Tsinghua University, former full-time researcher at OpenAI, research areas include reinforcement learning, large model alignment, human Computer interaction, robot learning, etc. Obtained a PhD from the University of California, Berkeley, in 2019, studying under Professor Stuart Russell; graduated from Tsinghua University's Institute of Interdisciplinary Information (Yao Class) in 2014. His representative works include: NIPS2016 best paper, Value Iteration Network; the most cited paper in the field of multi-agent deep reinforcement learning, MADDPG algorithm; OpenAI hide-and-seek project, etc.

How to make large models better comply with human instructions and intentions? How to make large models have better reasoning capabilities? How to avoid hallucinations in large models? Whether these problems can be solved is the most critical technical challenge to make large models truly widely available and even achieve super intelligence. These most difficult challenges are also the long-term research focus of Wu Yi’s team, and are problems to be overcome by large model alignment technology (Alignment).

In alignment technology, the most important algorithm framework is reinforcement learning based on human feedback (RLHF, Reinforcement Learning from Human Feedback). RLHF learns a reward function (Reward Model) based on human feedback based on human preference feedback for the output of the large model, and further performs reinforcement learning training on the large model, so that the large model can learn to distinguish the quality of responses in repeated iterations, and achieve Improved model capabilities. Currently, the world's most powerful language models, such as OpenAI's GPT model and Anthropic's Claude model, place great emphasis on the importance of RLHF training. OpenAI and Anthropic have also internally developed RLHF training systems based on large-scale PPO algorithms for large model alignment.

However, due to the complex process of the PPO algorithm and the high consumption of computing power, the large-scale RLHF training system of the American AI company has never been open source. Therefore, although the PPO algorithm is very powerful, alignment work in academia has rarely used complex methods. The PPO algorithm is used for RLHF research, and alignment algorithms such as SFT (supervised fine-tuning) or DPO (Direct Policy Optimization) are generally used that are simpler, more direct, and have lower requirements on the training system.

So, does a simple alignment algorithm definitely work better? The work "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study" published by Wu Yi's team at ICML 2024 carefully discussed the characteristics of the DPO and PPO algorithms and pointed out the key points to improve the effect of the RLHF algorithm. In this work, based on the self-developed large-scale RLHF training system, Wu Yi's team used the PPO algorithm and an open source model with fewer parameters for the first time, surpassing the closed-source large-scale code generation task CodeContest in the recognized most difficult challenge. Model AlphaCode 41B.