Home  >  Article  >  Technology peripherals  >  CMU conducted a detailed comparative study and found that GPT-3.5 is superior to Gemini Pro, ensuring fair, transparent and reproducible performance

CMU conducted a detailed comparative study and found that GPT-3.5 is superior to Gemini Pro, ensuring fair, transparent and reproducible performance

PHPz
PHPzforward
2023-12-21 08:13:38821browse

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

What is the strength of Google Gemini? Carnegie Mellon University conducted a professional and objective third-party comparison

To ensure fairness,all models use the same prompts and generation parameters, and provide reproducible code and complete Transparent results.

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

will not use CoT@32 to compare 5-shot like Google’s official conference .

Result in one sentence: The Gemini Pro version is close to but slightly inferior to GPT-3.5 Turbo, GPT-4 is still far ahead.

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

In the in-depth analysis, we also found some strange characteristics of Gemini, such as I like to choose D for multiple-choice questions...

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Many researchers said that Gemini underwent very detailed testing just a few days after its release, which is a very remarkable achievement

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Six major tasks in-depth test

This test specifically compares 6 different tasks, and selects the corresponding data set for each task

  • Question and Answer: MMLU
  • Reasoning: BIG-Bench Hard
  • Math: GSM8k, SVAMP, ASDIV, MAWPS
  • Code: HumanEval, ODEX
  • Translation: FLORES
  • Surfing the Internet: WebArena

Trivia: Like to choose D

According to the results, it can be seen that using thought chain prompts in this type of task does not necessarily improve the effect

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

In the MMLU data set, all questions are multiple-choice questions. After further analyzing the results, a strange phenomenon was discovered: Gemini prefers option D. The distribution of the GPT series among the four options is much more balanced. The team suggested that this may be the reason why Gemini

caused by not fine-tuning a lot of instructions for multiple-choice questions.

In addition, Gemini’s security filtering is very strict. When it comes to ethical questions, it only answers 85% of the questions. And when it came to questions related to human sexuality, it only answered 28% of the questions

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Gemini Pro outperformed GPT in security studies and high school microeconomics - 3.5, but the gap is not big, and the team said it could not find anything special

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Reasoning: Not good at long questions

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

The GPT series performs better when dealing with longer and more complex problems. In comparison, Gemini Pro performs poorly.

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复Especially on long problems, GPT-4 Turbo has almost no performance. The performance drops, which shows that it has a strong ability to understand complex problems. This type of problem involves people exchanging items, and ultimately requires AI to determine which items each person owns

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Tasks Gemini excels at include understanding the world's sports knowledge, manipulating symbol stacks, sorting words alphabetically, and parsing tables

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

##Mathematics: Surpassing in complex tasks

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

The question itself is too long, causing the performance of Gemini Pro and GPT-3.5 to decline at the same time. Only GPT-4 can maintain a consistent level

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

When the length of the thought chain reaches its longest, Gemini exceeds GPT-3.5

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Code: Good at matplotlib

For code questions, Gemini does not perform well on questions with longer reference answers

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

The GPT series is more powerful in most types, but performs poorly on matplotlib Not at all good

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

#Translation: as long as it is answered, the quality is high

In the translation task, Gemini refused to answer 12 types of questions, but As long as the translation quality is excellent, the overall performance exceeds GPT-4

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

The languages ​​Gemini refuses to translate mainly involve Latin and Arabic

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Network Navigation: Good at cross-site surfing

WebArena simulates an Internet environment for AI, including e-commerce, social forums, GitLab collaborative development, content management systems, and online maps. AI needs to find information in this environment or complete tasks across sites

Gemini performs worse overall than GPT-3.5 Turbo, but performs slightly better on tasks across multiple sites.

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Netizen: But it’s free

In the end, CMU associate professor Graham Newbig acknowledged some limitations of the study

    API-based model behavior may change at any time
  • Only a limited number of prompts have been tried, and the prompt words applicable to different models may be different
  • There is no way to control whether the test set is Leak

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Zhou Dengyong, head of Google’s large model inference team, pointed out that setting Gemini’s temperature to 0 can increase it by 5-10 percentage points, which is very useful for inference tasks. Help

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

In this test, in addition to the Gemini and GPT series, the recently received open source MoE model Mixtral

However, reinforcement learning Expert Noam Brown believes that the results of Mixtral can be ignored because it uses a third-party API rather than the official implementation

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

The founder of Mistral AI has provided the team with access to the official version, which he believes will bring better results

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Although Gemini Pro is not as good as GPT-3.5, Its advantage is that it can be used for free if it does not exceed 60 calls per minute.

Therefore, many individual developers have changed camps

Gemini Pro还不如GPT-3.5,CMU深入对比研究:保证公平透明可重复

Currently Gemini has the highest The Ultra version has not yet been released, and the CMU team plans to continue this research by then. Do you think Gemini Ultra can reach the level of GPT-4?

This article introduces the paper in detail: https://arxiv.org/abs/2312.11444

Reference link:


[1]https://twitter.com/gneubig/status/1737108977954251216.

The above is the detailed content of CMU conducted a detailed comparative study and found that GPT-3.5 is superior to Gemini Pro, ensuring fair, transparent and reproducible performance. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete