Home >Technology peripherals >AI >One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'

One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'

WBOY
WBOYforward
2023-11-17 12:38:44716browse

Nowadays, many big models claim to be good at mathematics. Who has the real talent? Who "cheated" on the back-to-back test questions?

This year, someone conducted a comprehensive test on the questions just announced for the Hungarian National Mathematics Final Examination

Many models suddenly became successful"Now The original shape” .

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Look at the green part first, these large models have similar results on the classic mathematics test set GSM8k and the new paper, Together they form the reference standard .

Looking at the

red part, the result on GSM8K is significantly higher than that of the large model with the same parameter scale.As soon as it arrives The score on the new paper dropped significantly, almost the same as the large model of the same size. The researchers classified them as

"suspected or known to have been trained on GSM8k"

. After watching this test, some people said that they should start evaluating questions that they have never seen before

Some people think that this kind of test And everyone’s actual use experience of large models is currently the only reliable evaluation methodOne trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk Grok is second only to GPT-4, and the open source Llemma has excellent resultsOne trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Tester

Keiran Paster

is a PhD student at the University of Toronto, a Google student researcher, and one of the authors of the large Lemma model in the test.

Let the big model take the Hungarian national high school mathematics final exam. This trick comes from One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Musk’s xAI

. In order to rule out the problem that xAI's Grok large model accidentally saw test questions in network data, in addition to several common test sets, this test was also conducted

This exam this year The test was only completed at the end of May, and the current large model has basically never had the opportunity to see this set of test questions.

xAI also announced the results of GPT-3.5, GPT-4, and Claude 2 when it was released for comparison.

Based on this set of data, Paster conducted further tests. The test objects were multiple open source models with strong mathematical capabilitiesOne trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

and The test questions, test scripts, and answer results of each model are

open sourced on Huggingface

for everyone to check and further test other models.

The results show that GPT-4 and Claude-2 form the first echelon, with very high scores on GSM8k and new papers. One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Although this does not mean that there are no GSM8k leaked questions in the training data of GPT-4 and Claude 2, but at least they have good generalization capabilities and can solve new questions correctly, so they don’t care.

Next, Musk xAI’s Grok-0

(33B)

and Grok-1

(unpublished parameter scale) performed well.

Grok-1 has the highest score in the "non-cheating group", and his new paper score is even higher than Claude 2.

Grok-0's performance on GSM8k is close to GPT3.5-Turbo, and slightly worse on the new paper.

Except for the above-mentioned closed models, the other models in the test are all open source

Code Llama series

is Meta’s own version of Llama 2 It is basically fine-tuned, focusing on generating code based on natural language. Now it seems that the mathematical ability is slightly worse than models of the same scale.

Based on Code Llama, many universities and research institutions jointly launched the One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Llemma series

, which was open sourced by EleutherAI. The team collected the Proof-Pile-2 dataset from scientific papers, network data containing mathematics, and mathematical code. After training, Llemma can use tools and do formal theorem proofs without any further fine-tuning.

On the new paper, the performance of Llemma 34B is close to the GPT-3.5 Turbo level

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

Mistral series is trained by the French AI unicorn Mistral AI. The Apache2.0 open source agreement is more relaxed than Llama, becoming a sheep The most popular basic model in the open source community after the Tuo family.

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirror

#OpenChat 3.5 and MetaMath Mistral are all fine-tuned based on the Mistral ecosystem.

MetaMath and MAmmoTH Code are based on the Code Llama ecosystem. Those who choose to adopt open source large models in actual business need to be careful to avoid this group, because they are likely to perform well just to boost the rankings, but their actual capabilities may not be as strong as other models of the same scale

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirrorMany netizens expressed their gratitude to Paster for this experiment, believing that this is exactly what is needed to understand the actual situation of the model.

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirrorSome people have expressed concerns:

From this day on, everyone who trains large models will add Hungarian math exam questions from previous years.

At the same time, he believes that the solution may be to have a

specialized large model evaluation company with proprietary testing.

One trick to distinguish large-scale cheating models, doctors brothers open source AI mathematical demon mirrorAnother proposal is to

Establish a test benchmark that is updated year by year to alleviate the overfitting problem.

The above is the detailed content of One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical 'demon mirror'. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete