A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.-AI-php.cn

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 02, 2024 pm 01:21 PM

dataModelllm

Which company is the best in the large model rankings? Also watch LLM Arena~

As of now, a total of 90 LLMs have joined the battle, and the total number of user votes has exceeded 770,000.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

However, while netizens are making fun of new models rushing to the top and old models losing their dignity,

LMSYS, the organization behind Renjia Arena, has quietly completed the transformation of results: the most convincing benchmark test born from actual combat-Arena-Hard.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

The four advantages demonstrated by Arena-Hard are exactly what the current LLM benchmark test needs most. of:

- separability (87.4%) is significantly better than MT-bench (22.6%);

- with Chatbot Arena The closest ranking at 89.1%;

- fast and cheap ($25)

- frequently updated with real-time data

The Chinese translation is, first of all, the examination of this large-scale model must be differentiated, and not even poor students can get 90 points;

Secondly , the exam questions should be more realistic, and the scoring should be strictly aligned with human preferences;

In the end, the questions must not be leaked, so the test data must be updated frequently to ensure the fairness of the exam;

——The last two requirements are tailor-made for LLM Arena.

Let’s take a look at the effect of the new benchmark:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

The above figure compares Arena Hard v0.1 with the previous SOTA benchmark MT Bench.

We can find that compared with MT Bench, Arena Hard v0.1 has stronger separability (surging from 22.6% to 87.4%), and the confidence interval is also narrower .

In addition, take a look at this ranking. It is basically consistent with the latest LLM arena ranking below:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

This shows that Arena Hard’s evaluation is very close to human preference (89.1%).

——Arena Hard can be regarded as opening up a new method of crowdsourcing:

Netizens got a free experience, and the official platform got the most Impactful leaderboards, and fresh, high-quality data – a world where no one gets hurt is complete.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.

Asking questions for large models

Let’s take a look at how to build this benchmark test.

To put it simply, it is how to select some better ones from the 200,000 user prompts (questions) in the arena.

This "good" is reflected in two aspects: diversity and complexity. The following figure shows Arena-Hard’s workflow:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

To summarize: first classify all prompts ( There are more than 4,000 topics divided here), and then some artificial standards are set to score each prompt, and the average score is calculated for prompts in the same category.

Categories with high scores can be considered to be of high complexity (or quality) - which is the meaning of "Hard" in Arena-Hard.

Select the top 250 highest-scoring categories (250 ensures diversity), and randomly select 2 lucky prompts from each category to form the final benchmark test set (500 prompts).

Expand in detail below:

diversity

The researchers first transformed each tip using OpenAI’s text-embedding-3-small, reduced the dimensionality using UMAP, and A hierarchical-based clustering algorithm (HDBSCAN) was used to identify clusters, followed by aggregation using GPT-4-turbo.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.

Complexity

through the seven key criteria in the table below Select high-quality user queries:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. image

1. Does the prompt ask for specific output?

2. Does it cover one or more specific areas?

3. Are there multiple levels of reasoning, components, or variables?

4. Should AI directly demonstrate its ability to solve problems?

5. Is there a level of creativity involved?

6. Is technical accuracy of the response required?

7. Is it relevant to practical applications?

For each tip, use LLM (GPT-3.5-Turbo, GPT-4-Turbo) to mark how many criteria it meets (score 0 to 7), and then calculate each Average score of group cues (clusters).

The following figure shows the average score ranking of some clusters:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

We can observe that clusters with higher scores are usually more challenging topics (such as game development, mathematical proofs), while clusters with lower scores belong to trivial or ambiguous problems.

With this complexity, the gap between top students and poor students can be widened. Let’s look at the following experimental results:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

In the above three comparisons, assume that GPT-4 is stronger than Llama2-70b, Claude's large cup is stronger than medium cup, and Mistral-Large is stronger than Mixtral ,

We can see that as the (complexity) score increases, the winning rate of stronger models also increases - the top students get distinguished, and the bad students get filtered.

Because the higher the score (the more complex the problem), the better the discrimination, so 250 high-quality classifications with an average score >= 6 points (out of 7 points) were finally selected .

Then, 2 tips from each category were randomly selected to form this version of the benchmark - Arena-Hard-v0.1.

Is the teacher who judges the test papers reliable?

After the test papers are out, who will judge them is a question.

Manual work is of course the most accurate, and because this is the "Hard mode", many issues involving domain knowledge still require experts to evaluate - this is obviously not possible.

The next best thing is to choose GPT-4, the smartest model currently recognized, as the test teacher.

For example, in the charts above, all aspects of scoring are handled by GPT-4. Additionally, the researchers used CoT to prompt LLM to generate answers before making a verdict.

GPT-4 judgment results

The following uses gpt-4-1106-preview as the judgment model, and the baseline for comparison is used gpt-4-0314.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

The Bradley-Terry coefficients for each model are compared and calculated in the table above and converted relative to the baseline The winning percentage is used as the final score. The 95% confidence intervals were calculated through 100 rounds of bootstrapping.

Claude expressed dissatisfaction

——I, Claude-3 Opus, am also tied for first in the rankings, why should I let GPT be the judge? Teacher Juan?

So, the researchers compared the performance of GPT-4-1106-Preview and Claude-3 Opus as marking teachers.

Summary in one sentence: GPT-4 is a strict father, Claude-3 is a loving mother.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Separability across models is higher when scored using GPT-4 (ranging from 23.0 to 78.0 ).

When using Claude-3, the scores of most models have improved a lot: I must take care of my own models, and I also like open source models (Mixtral, Yi, Starling), gpt-4-0125-preview is indeed better than mine.

Claude-3 even loves gpt-3.5-0613 more than gpt-4-0613.

The following table further compares GPT-4 and Claude-3 using separability and consistency metrics:

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

From the results data, GPT-4 is significantly better in all indicators.

By manually comparing the different judgment examples between GPT-4 and Claude-3, we can find that when the two LLMs disagree, they can usually be divided into two major categories:

Conservative scoring, and a different take on user tips.

Claude-3-Opus is more lenient in its scoring and is much less likely to give harsh scores - it is particularly hesitant to claim one answer over another. Much better."

In contrast, GPT-4-Turbo identifies errors in model responses and penalizes the model with significantly lower scores.

On the other hand, Claude-3-Opus sometimes ignores smaller errors. Even when Claude-3-Opus does find these errors, it tends to treat them as minor issues and is very lenient during the scoring process.

Even in coding and math problems where small mistakes can actually completely ruin the final answer, Claude-3-Opus still gives leniency to these mistakes, GPT-4-Turbo Not so.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

For another small set of tips, Claude-3-Opus and GPT-4-Turbo use fundamentally different Judgment based on angle.

For example, given a coding problem, Claude-3-Opus favors a simple structure that does not rely on external libraries, which can provide the user with a response of maximum educational value.

And GPT-4-Turbo may prioritize responses that provide the most practical answers, regardless of its educational value to the user.

While both explanations are valid criteria for judging, GPT-4-Turbo's view may be closer to that of ordinary users.

See the image below for specific examples of different judgments, many of which exhibit this phenomenon.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Limitations Test

LLM Would you like a longer answer? ?

The average token length and score of each model on MT-Bench and Arena-Hard-v0.1 are plotted below. Visually, there is not a strong correlation between fraction and length.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

To further examine potential verbosity bias, the researchers used GPT-3.5-Turbo on three different systems Prompts (original, talkative, detailed) were ablated.

The results show that the judgments of both GPT-4-Turbo and Claude-3-Opus may be affected by longer output, while Claude is more affected (because GPT-3.5- Turbo's winning rate against GPT-4-0314 is over 40%).

Interestingly, "talkative" had little impact on the winning rates of the two judges, indicating that output length is not the only factor, and that more detailed answers may also be favored by the LLM judges.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Tips for experimentation:

##detailed: You are a helpful assistant who thoroughly explains things with as much detail as possible.

chatty: You are a helpful assistant who is chatty.

GPT-4 VARIANCE OF JUDGMENT

The researchers found that even if temperature = 0, GPT-4-Turbo may still produce slightly different judgments.

The following judgment is repeated three times for gpt-3.5-turbo-0125 and the variance is calculated.

A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students. Picture

Due to limited budget, only one evaluation of all models is performed here. However, the authors recommend using confidence intervals to determine model separation.

Reference:https://www.php.cn/link/6e361e90ca5f9bee5b36f3d413c51842

The above is the detailed content of A new way to play crowdsourcing! Benchmark test was born in LLM Arena to strictly separate the bad students and the top students.. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Are You At Risk Of AI Agency Decay? Take The Test To Find OutApr 21, 2025 am 11:31 AM

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

How to Build an AI Agent from Scratch? - Analytics VidhyaApr 21, 2025 am 11:30 AM

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

Revisiting The Humanities In The Age Of AIApr 21, 2025 am 11:28 AM

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

Understanding LangChain Agent FrameworkApr 21, 2025 am 11:25 AM

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

What are the Radial Basis Functions Neural Networks?Apr 21, 2025 am 11:13 AM

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

The Meshing Of Minds And Machines Has ArrivedApr 21, 2025 am 11:11 AM

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

Insights on spaCy, Prodigy and Generative AI from Ines MontaniApr 21, 2025 am 11:01 AM

This "Leading with Data" episode features Ines Montani, co-founder and CEO of Explosion AI, and co-developer of spaCy and Prodigy. Ines offers expert insights into the evolution of these tools, Explosion's unique business model, and the tr

A Guide to Building Agentic RAG Systems with LangGraphApr 21, 2025 am 11:00 AM

This article explores Retrieval Augmented Generation (RAG) systems and how AI agents can enhance their capabilities. Traditional RAG systems, while useful for leveraging custom enterprise data, suffer from limitations such as a lack of real-time dat

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 English version

Recommended: Win version, supports code prompts!

SublimeText3 Chinese version

Chinese version, very easy to use

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7608

CakePHP Tutorial

1387

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

136