首页 >科技周边 >人工智能 >OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了

OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了

王林原创: 2024-07-19 01:29:521017浏览

如果 AI 模型给的答案一点也看不懂，你敢用吗？

随着机器学习系统在更重要的领域得到应用，证明为什么我们可以信任它们的输出，并明确何时不应信任它们，变得越来越重要。

获得对复杂系统输出结果信任的一个可行方法是，要求系统对其输出产生一种解释，这种解释对人类或另一个受信任的系统来说是可读的，即可以完全理解以至于任何可能的错误都可以被发现。例如，为了建立对司法系统的信任，我们要求法院提供清晰易读的书面意见，解释并支持其决策。

对于大型语言模型来说，我们也可以采用类似的方法。

不过，在采用这种方法时，确保语言模型生成的是易于理解的文本，这件事非常重要，尤其是在处理数学、编码这样的复杂任务时。

如下图所示，你让 AI 写一个快速排序算法，AI 很快就写出来了，而且答案很简洁。但不会写代码的你要怎么判断 AI 写得对不对呢？ OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了

OpenAI 在一篇最新的论文中研究了这个问题。

论文标题：PROVER-VERIFIER GAMES IMPROVE LEGIBILITY OF LLM OUTPUTS
论文链接：https://cdn.openai.com/prover-verifier-games-improve-legibility-of-llm-outputs/legibility.pdf

他们发现，如果以「得到正确答案」为目标，要求大语言模型生成答案，它们给出的答案可能会很难懂。人类评估员在判断这些答案时，犯错的可能性会加倍。

所以他们的目标是在训练高级语言模型的过程中，让它们生成那些较简单的模型也能够轻松验证的文本。他们发现，这样的文本不仅对简单模型友好，对于人类来说也更可读。

那么，具体是怎么实现的呢？这还要提到 2021 年的一篇论文 ——「Learning to Give Checkable Answers with Prover-Verifier Games」。在这篇论文中，Cem Anil 等人提出了一个名为「Prover-Verifier Games（证明者 - 验证者博弈）」的博弈论框架。该框架由两个相互竞争的学习者组成：一个可信的验证者（Verifier）网络试图选择正确的答案，另一个更强大但不可信的证明者（Prover）网络试图说服验证者接受特定的答案，而不管其正确性如何。该框架的目标是从博弈中产生一个可靠的证明协议。

OpenAI 在这项新研究中借鉴了这个框架，让两个大模型（较强的为证明者，较弱的为验证者）互相博弈，直到二者达到一种平衡。此时，证明者给出的答案既有一定的可读性，又保持了合理的正确性。

OpenAI 表示，通过提高模型输出的可读性，可以使人工智能的应用更加有效和可信，这对于需要精确沟通的各个领域都是有益的。

值得一提的是，这篇论文来自 OpenAI 的超级对齐团队，其中几位作者在今年早些时候已经离开 OpenAI，团队也已经分崩离析。其中一位作者透露，这是该团队的最后一篇论文。

OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了

一个直观的例子

为了提高模型输出的可读性，OpenAI 优化了一些小学数学问题的思维链（CoT），让它能够被较弱的模型理解。举例如下：

问题：肖娜（Shawna）的父亲的年龄是肖娜的五倍。肖娜现在的年龄是阿莉娅（Aliya）的三倍。如果阿莉娅 3 岁，那么肖娜的父亲多少岁？

The purpose of this is to verify whether the output of the model trained in this way is more direct and easy to understand.

This method improves the performance of the model by 50% with only some optimizations on the training set. At the same time, the answers output by the model are also more understandable, which also helps human users who need to evaluate the correctness of these answers to make more accurate judgments. OpenAI believes that further narrowing the performance gap between more readable and more capable models is a key goal in future alignment research.

How do the "prover" and "verifier" play in training?

In order to motivate the strong model (prover) to generate more readable text, the researchers adopted a training scheme that requires the answers generated by the strong model to be easily verified by the weaker model (verifier) . Specifically, they used large and small models from the GPT-4 family. They differ by about three orders of magnitude in the amount of pre-training data. They all received some mathematical derivation training based on human annotation, but not extensive post-training.

First, the research team asked the prover model to try to solve a series of mathematical problems. Because the prover model itself may be wrong, some of the answers it generates may be incorrect. Next, they submitted these answers to the verifier model for testing, and conducted multiple rounds of alternating verification between the following two stages:

"Verifier" training: The research team trained the weaker " The "verifier" model verifies whether the answer given by the "prover" is correct.
"Prover" training: The research team instructs the more powerful "prover" model to play a "useful" or "cunning" role.
When the "Prover" is set to "Useful", the research team will reward the correct answer with a high score from the "Verifier".
When the "prover" is set to "cunning", the research team instead rewards the "verifier" with high-scoring wrong answers.

The research team found that initially, a cunning “prover” could easily generate incorrect answers that convinced the “verifier”. However, these wrong answers will enter the next round of "verifier" training, and the same vulnerabilities will no longer be convincing in the next round. Therefore, the "prover" will be forced to explore different strategies in each round.

When the cunning “prover” can no longer generate a wrong solution that convinces the “verifier”, the training ends.

What have researchers discovered?

Performance vs. readability trade-off:

A model that only performs supervised fine-tuning on human-written derivation has higher readability but lower accuracy Difference.
When the optimization goal of the model is only correctness, that is, it only focuses on generating correct answers, although high accuracy can be achieved, the readability of the generated solutions is poor.
The checkable game method proposed by OpenAI balances the two, maintaining high readability while maintaining moderate accuracy.

Three useful models:

Robust Verifier: Effectively distinguishing between correct and incorrect solutions even if the solutions are misleading.
Useful Prover: Generate solutions that are understandable to humans, thereby reducing errors by human evaluators.
Cunning prover: Produces subtle, erroneous solutions that initially confuse human evaluators, thus highlighting areas for further improvement.

The research team used the same test question to demonstrate the generated content and training process of Sneaky Prover.

Question: Shawna’s father is five times older than Shawna. Shauna is now three times Aliya's age.If Aaliyah is 3, how old is Shauna's father?

OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了

Conclusion

With the increasing use of large language models (LLMs) in critical domains and complex applications, it has become crucial to understand whether the content they generate is trustworthy. It's important. By requiring models to have clear and verifiable reasons for their production, you can enhance trust in what they produce.

A significant advantage of this approach is that it reduces reliance on human demonstration or readability judgment. This autonomy is particularly important for the alignment of future superintelligent AI systems, with the ultimate goal of reliably aligning AI systems with human values and expectations without direct human oversight.

Although this work was only conducted on one dataset and ground truth labels are still needed, the research team still expects this to be important in developing a correct, transparent and verifiable AI system. Class methods will play a key role and enhance their trustworthiness and security in real-world applications.

For more details, please refer to the original paper.

^{Reference link:}

^{https://openai.com/index/prover-verifier-games-improve-legibility/}

以上是OpenAI超级对齐团队遗作：两个大模型博弈一番，输出更好懂了的详细内容。更多信息请关注PHP中文网其他相关文章！

快速排序算法人工智能 https gpt

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇：清华包揽最佳论文+时间检验奖，山大获荣誉提名，SIGIR 2024奖项出炉下一篇：独家对话李岩：宿华、经纬、红点资金支持，第一个「生成式推荐」创业公司｜AI Pioneers

查看更多