search
HomeTechnology peripheralsAIOne trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'

Nowadays, many big models claim to be good at mathematics. Who has the real talent? Who "cheated" on the back-to-back test questions?

This year, someone conducted a comprehensive test on the questions just announced for the Hungarian National Mathematics Final Examination

Many models suddenly became successful"Now The original shape” .

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Look at the green part first, these large models have similar results on the classic mathematics test set GSM8k and the new paper, Together they form the reference standard .

Looking at the

red part, the result on GSM8K is significantly higher than that of the large model with the same parameter scale.As soon as it arrives The score on the new paper dropped significantly, almost the same as the large model of the same size. The researchers classified them as

"suspected or known to have been trained on GSM8k"

. After watching this test, some people said that they should start evaluating questions that they have never seen before

Some people think that this kind of test And everyone’s actual use experience of large models is currently the only reliable evaluation methodOne trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk Grok is second only to GPT-4, and the open source Llemma has excellent resultsOne trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Tester

Keiran Paster

is a PhD student at the University of Toronto, a Google student researcher, and one of the authors of the large Lemma model in the test.

Let the big model take the Hungarian national high school mathematics final exam. This trick comes from One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk’s xAI

. In order to rule out the problem that xAI's Grok large model accidentally saw test questions in network data, in addition to several common test sets, this test was also conducted

This exam this year The test was only completed at the end of May, and the current large model has basically never had the opportunity to see this set of test questions.

xAI also announced the results of GPT-3.5, GPT-4, and Claude 2 when it was released for comparison.

Based on this set of data, Paster conducted further tests. The test objects were multiple open source models with strong mathematical capabilitiesOne trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

and The test questions, test scripts, and answer results of each model are

open sourced on Huggingface

for everyone to check and further test other models.

The results show that GPT-4 and Claude-2 form the first echelon, with very high scores on GSM8k and new papers. One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Although this does not mean that there are no GSM8k leaked questions in the training data of GPT-4 and Claude 2, but at least they have good generalization capabilities and can solve new questions correctly, so they don’t care.

Next, Musk xAI’s Grok-0

(33B)

and Grok-1

(unpublished parameter scale) performed well.

Grok-1 has the highest score in the "non-cheating group", and his new paper score is even higher than Claude 2.

Grok-0's performance on GSM8k is close to GPT3.5-Turbo, and slightly worse on the new paper.

Except for the above-mentioned closed models, the other models in the test are all open source

Code Llama series

is Meta’s own version of Llama 2 It is basically fine-tuned, focusing on generating code based on natural language. Now it seems that the mathematical ability is slightly worse than models of the same scale.

Based on Code Llama, many universities and research institutions jointly launched the One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Llemma series

, which was open sourced by EleutherAI. The team collected the Proof-Pile-2 dataset from scientific papers, network data containing mathematics, and mathematical code. After training, Llemma can use tools and do formal theorem proofs without any further fine-tuning.

On the new paper, the performance of Llemma 34B is close to the GPT-3.5 Turbo level

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Mistral series is trained by the French AI unicorn Mistral AI. The Apache2.0 open source agreement is more relaxed than Llama, becoming a sheep The most popular basic model in the open source community after the Tuo family.

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

#OpenChat 3.5 and MetaMath Mistral are all fine-tuned based on the Mistral ecosystem.

MetaMath and MAmmoTH Code are based on the Code Llama ecosystem. Those who choose to adopt open source large models in actual business need to be careful to avoid this group, because they are likely to perform well just to boost the rankings, but their actual capabilities may not be as strong as other models of the same scale

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirrorMany netizens expressed their gratitude to Paster for this experiment, believing that this is exactly what is needed to understand the actual situation of the model.

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirrorSome people have expressed concerns:

From this day on, everyone who trains large models will add Hungarian math exam questions from previous years.

At the same time, he believes that the solution may be to have a

specialized large model evaluation company with proprietary testing.

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirrorAnother proposal is to

Establish a test benchmark that is updated year by year to alleviate the overfitting problem.

The above is the detailed content of One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Gemma Scope: Google's Microscope for Peering into AI's Thought ProcessGemma Scope: Google's Microscope for Peering into AI's Thought ProcessApr 17, 2025 am 11:55 AM

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Who Is a Business Intelligence Analyst and How To Become One?Who Is a Business Intelligence Analyst and How To Become One?Apr 17, 2025 am 11:44 AM

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

How to Add a Column in SQL? - Analytics VidhyaHow to Add a Column in SQL? - Analytics VidhyaApr 17, 2025 am 11:43 AM

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Business Analyst vs. Data AnalystBusiness Analyst vs. Data AnalystApr 17, 2025 am 11:38 AM

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

What are COUNT and COUNTA in Excel? - Analytics VidhyaWhat are COUNT and COUNTA in Excel? - Analytics VidhyaApr 17, 2025 am 11:34 AM

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Chrome is Here With AI: Experiencing Something New Everyday!!Chrome is Here With AI: Experiencing Something New Everyday!!Apr 17, 2025 am 11:29 AM

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

AI's Human Side: Wellbeing And The Quadruple Bottom LineAI's Human Side: Wellbeing And The Quadruple Bottom LineApr 17, 2025 am 11:28 AM

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

5 Game-Changing Quantum Computing Use Cases You Should Know About5 Game-Changing Quantum Computing Use Cases You Should Know AboutApr 17, 2025 am 11:24 AM

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)