search
HomeTechnology peripheralsAI750,000 rounds of one-on-one battle between large models, GPT-4 won the championship, and Llama 3 ranked fifth

Regarding Llama 3, another test result has been released——

The large model evaluation community LMSYS released a large model ranking list, Llama 3 ranked fifth, English single item and GPT -4 tied for first place.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

Different from other Benchmarks, this list is based on model one-on-one battles, and the evaluators from all over the network make their own propositions and scores.

In the end, Llama 3 ranked fifth on the list, followed by three different versions of GPT-4 and Claude 3 Super Cup Opus.

In the English single list, Llama 3 overtook Claude and tied with GPT-4.

Regarding this result, Meta's chief scientist LeCun was very happy, retweeted the tweet and left a "Nice".

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

Soumith Chintala, the father of PyTorch, also said excitedly that such results are incredible and he is proud of Meta .

The 400B version of Llama 3 has not yet come out, and it won the fifth place based on 70B parameters alone...
I still remember that when GPT-4 was released in March last year, it achieved the same level. Performance is almost impossible.
……
The popularization of AI now is truly incredible, and I am very proud of my colleagues at Meta AI for achieving such success.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

So, what specific results does this list show?

Nearly 90 models competed in 750,000 rounds

As of the release of the latest list, LMSYS has collected nearly 750,000 large model solo battle results, involving 89 models.

Among them, Llama 3 has participated 12,700 times, and GPT-4 has multiple different versions, with the most participating 68,000 times.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

The picture below shows the number of competitions and winning rates of some popular models. Neither of the two indicators in the picture counts the number of draws.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

In terms of the list, LMSYS is divided into a general list and multiple sub-lists. GPT-4-Turbo ranked first and tied with it. The ones are the earlier 1106 version and the Claude 3 extra large Opus.

Another version (0125) of GPT-4 ranks second, followed closely by Llama 3.

But what’s more interesting is that the newer version 0125 does not perform as well as the older version 1106.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

In the English single list, Llama 3's results directly tied with the two GPT-4s, and even surpassed 0125 Version.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

The first place in the Chinese proficiency ranking is shared by Claude 3 Opus and GPT-4-1106, while Llama 3 has been ranked 20 Outstanding name.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

In addition to language ability, the list also includes long text and code ability rankings, and Llama 3 is also among the best.

However, what are the specific "rules of the game" of LMSYS?

A large model test that everyone can participate in

This is a large model test that everyone can participate in. The questions and evaluation criteria are decided by the participants.

The specific "competition" process is divided into two modes: battle and side-by-side.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

In battle mode, after entering the question in the test interface, the system will randomly call the two models in the library, and the tester will not Knowing who the system has drawn, only "Model A" and "Model B" are displayed on the interface.

After the model outputs the answer, the evaluator needs to choose which one is better or a tie. Of course, if the model's performance does not meet expectations, there are corresponding options.

Only after a selection is made, the model's identity is revealed.

Side-by-side is where the user selects the specified model for PK. The rest of the test process is the same as the battle mode.

However, only the voting results in the anonymous mode of the battle will be counted, and in If the model accidentally reveals its identity during the conversation, the results will be invalid.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

According to the Win Rate of each model to other models, such an image can be drawn:

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

△Schematic diagram, earlier version

The final ranking is obtained by using Win Rate data and converting it into scores through the Elo evaluation system.

The Elo rating system is a method of calculating the relative skill level of players, designed by American physics professor Arpad Elo.

Specifically for LMSYS, under the initial conditions, the ratings (R) of all models are set to 1000, and then the expected winning rate (E) is calculated based on this formula.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

As the test continues, the score will be corrected based on the actual score (S). There are three types of S: 1, 0 and 0.5 The values ​​correspond to three situations: winning, losing and drawing.

The correction algorithm is shown in the following formula, where K is the coefficient, which needs to be adjusted by the tester according to the actual situation.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

After all valid data are finally included in the calculation, the Elo score of the model is obtained.

However, during the actual operation, the LMSYS team found that the stability of this algorithm was insufficient, so it used statistical methods to correct it.

They used the Bootstrap method for repeated sampling, obtained more stable results, and estimated confidence intervals.

The final revised Elo score became the basis for ranking in the list.

One More Thing

Llama 3 can already run on the large model inference platform Groq (not Musk’s Grok).

The biggest highlight of this platform is "fast". Previously, the Mixtral model has been used to run at a speed of nearly 500 tokens per second.

It is also quite fast when running Llama 3. According to the actual test, the 70B version can run about 300 Tokens per second, and the 8B version is close to 800.

大模型一对一战斗75万轮,GPT-4夺冠,Llama 3位列第五Picture

Reference link:
[1]https://lmsys.org/blog/2023-05-03- arena/
[2]https://chat.lmsys.org/?leaderboard
[3]https://twitter.com/lmsysorg/status/1782483699449332144

The above is the detailed content of 750,000 rounds of one-on-one battle between large models, GPT-4 won the championship, and Llama 3 ranked fifth. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
GPT-4接入Office全家桶!Excel到PPT动嘴就能做,微软:重新发明生产力GPT-4接入Office全家桶!Excel到PPT动嘴就能做,微软:重新发明生产力Apr 12, 2023 pm 02:40 PM

一觉醒来,工作的方式被彻底改变。微软把AI神器GPT-4全面接入Office,这下ChatPPT、ChatWord、ChatExcel一家整整齐齐。CEO纳德拉在发布会上直接放话:今天,进入人机交互的新时代,重新发明生产力。​新功能名叫Microsoft 365 Copilot(副驾驶),与改变了程序员的代码助手GitHub Copilot成为一个系列,继续改变更多人。现在AI不光能自动做PPT,而且能根据Word文档的内容一键做出精美排版。甚至连上台时对着每一页PPT应该讲什么话,都给一起安排

集成GPT-4的Cursor让编写代码和聊天一样简单,用自然语言编写代码的新时代已来集成GPT-4的Cursor让编写代码和聊天一样简单,用自然语言编写代码的新时代已来Apr 04, 2023 pm 12:15 PM

集成GPT-4的Github Copilot X还在小范围内测中,而集成GPT-4的Cursor已公开发行。Cursor是一个集成GPT-4的IDE,可以用自然语言编写代码,让编写代码和聊天一样简单。 GPT-4和GPT-3.5在处理和编写代码的能力上差别还是很大的。官网的一份测试报告。前两个是GPT-4,一个采用文本输入,一个采用图像输入;第三个是GPT3.5,可以看出GPT-4的代码能力相较于GPT-3.5有较大能力的提升。集成GPT-4的Github Copilot X还在小范围内测中,而

GPT-4的两个谣言和最新预测!GPT-4的两个谣言和最新预测!Apr 11, 2023 pm 06:07 PM

​作者 | 云昭3月9日,微软德国CTO Andreas Braun在AI kickoff会议上带来了一个期待已久的消息:“我们将于下周推出GPT-4,届时我们将推出多模式模式,提供完全不同的可能性——例如视频。”言语之中,他将大型语言模型(LLM)比作“游戏改变者”,因为他们教机器理解自然语言,然后机器以统计的方式理解以前只能由人类阅读和理解的东西。与此同时,这项技术已经发展到“适用于所有语言”:你可以用德语提问,也可以用意大利语回答。借助多模态,微软(-OpenAI)将“使模型变得全面”。那

再一次改变“AI”世界 GPT-4千呼万唤始出来再一次改变“AI”世界 GPT-4千呼万唤始出来Apr 10, 2023 pm 02:40 PM

近段时间,人工智能聊天机器人ChatGPT刷爆网络,网友们争先恐后去领略它的超高情商和巨大威力。参加高考、修改代码、构思小说……它在广大网友的“鞭策”下不断突破自我,甚至可以用一整段程序,为你拼接出一只小狗。而这些技能只是基于GPT-3.5开发而来,在3月15日,AI世界再次更新,最新版本的GPT-4也被OpenAI发布了出来。与之前相比,GPT-4不仅展现了更加强大的语言理解能力,还能够处理图像内容,在考试中的得分甚至能超越90%的人类。那么,如此“逆天”的GPT-4还具有哪些能力?它又是如何

当GPT-4反思自己错了:性能提升近30%,编程能力提升21%当GPT-4反思自己错了:性能提升近30%,编程能力提升21%Apr 04, 2023 am 11:55 AM

GPT-4 的思考方式,越来越像人了。 人类在做错事时,会反思自己的行为,避免再次出错,如果让 GPT-4 这类大型语言模型也具备反思能力,性能不知道要提高多少了。众所周知,大型语言模型 (LLM) 在各种任务上已经表现出前所未有的性能。然而,这些 SOTA 方法通常需要对已定义的状态空间进行模型微调、策略优化等操作。由于缺乏高质量的训练数据、定义良好的状态空间,优化模型实现起来还是比较难的。此外,模型还不具备人类决策过程所固有的某些品质,特别是从错误中学习的能力。不过现在好了,在最近的一篇论文

「数学天才」陶哲轩:GPT-4无法攻克一个未解决的数学问题,但对工作有帮助「数学天才」陶哲轩:GPT-4无法攻克一个未解决的数学问题,但对工作有帮助Apr 10, 2023 pm 02:21 PM

当红炸子鸡ChatGPT,也成为数学天才陶哲轩的研究工具了。近日,他在网上称自己发现了一些ChatGPT的小用例。首先,它很擅长解析代码格式的文档(在这种情况下是#arXiv搜索的API),然后返回一个正确格式的代码查询(后来它还提供了一些工作的python代码,以我要求的方式调用这个API,尽管我不得不手动安装一个包来使它运行)。其次,我让它想出一些,聪明的学生在本科线性代数课上可能会问的问题(为此我提供了一些样本题目),它给出了一些很好的例子,让我对课程可能方向,以及潜在的作业问题有所启发。

微软 Bing Chat 聊天机器人已升级使用最新 OpenAI GPT-4 技术微软 Bing Chat 聊天机器人已升级使用最新 OpenAI GPT-4 技术Apr 12, 2023 pm 10:58 PM

3 月 15 日消息,今天 OpenAI 发布了全新的 GPT-4 大型语言模型,随后微软官方宣布,Bing Chat 此前已经升级使用 OpenAI 的 GPT-4 技术。微软公司副总裁兼消费者首席营销官 Yusuf Mehdi 确认 Bing Chat 聊天机器人 AI 已经在 GPT-4 上运行,ChatGPT 基于最新版本 GPT-4,由 OpenAI 开发 。微软 Bing 博客网站上的一篇帖子进一步证实了这一消息。微软表示,如果用户在过去五周内的任何时间使用过新的 Bing 预览版,

体验了首个接入GPT-4的代码编辑器,太炸裂了!体验了首个接入GPT-4的代码编辑器,太炸裂了!Apr 04, 2023 pm 02:35 PM

目前 Cursor 已经开源在 GitHub 上,已斩获了 9000+ GitHub Star,并成功登上 GitHub Trending。 最近一款名为Cursor的代码编辑器已经传遍了圈内,受到众多编程爱好者的追捧。它主打的亮点就是,通过 GPT-4 来辅助你编程,完成 AI 智能生成代码、修改 Bug、生成测试等操作。确实很吸引人,而且貌似也能大大节省人为的重复工作,让广大码农把有限的时间放在无限的需求构思上!目前 Cursor 已经开源在 GitHub 上,已斩获了 9000+ GitH

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft