Home > Article > Technology peripherals > One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical "demon mirror"
Nowadays, many big models claim to be good at mathematics. Who has the real talent? Who "cheated" on the back-to-back test questions?
This year, someone conducted a comprehensive test on the questions just announced for the Hungarian National Mathematics Final Examination
Many models suddenly became successful"Now The original shape” .
Look at the green part first, these large models have similar results on the classic mathematics test set GSM8k and the new paper, Together they form the reference standard .
Looking at thered part, the result on GSM8K is significantly higher than that of the large model with the same parameter scale.As soon as it arrives The score on the new paper dropped significantly, almost the same as the large model of the same size. The researchers classified them as
"suspected or known to have been trained on GSM8k". After watching this test, some people said that they should start evaluating questions that they have never seen before
Some people think that this kind of test And everyone’s actual use experience of large models is currently the only reliable evaluation method
Musk Grok is second only to GPT-4, and the open source Llemma has excellent results
TesterKeiran Pasteris a PhD student at the University of Toronto, a Google student researcher, and one of the authors of the large Lemma model in the test.
Let the big model take the Hungarian national high school mathematics final exam. This trick comes from
Musk’s xAI. In order to rule out the problem that xAI's Grok large model accidentally saw test questions in network data, in addition to several common test sets, this test was also conducted
This exam this year The test was only completed at the end of May, and the current large model has basically never had the opportunity to see this set of test questions. xAI also announced the results of GPT-3.5, GPT-4, and Claude 2 when it was released for comparison.Based on this set of data, Paster conducted further tests. The test objects were multiple open source models with strong mathematical capabilities
and The test questions, test scripts, and answer results of each model are open sourced on Huggingfacefor everyone to check and further test other models.
The results show that GPT-4 and Claude-2 form the first echelon, with very high scores on GSM8k and new papers.
Although this does not mean that there are no GSM8k leaked questions in the training data of GPT-4 and Claude 2, but at least they have good generalization capabilities and can solve new questions correctly, so they don’t care. Next, Musk xAI’s Grok-0(33B)
and Grok-1(unpublished parameter scale) performed well.
Grok-1 has the highest score in the "non-cheating group", and his new paper score is even higher than Claude 2.Grok-0's performance on GSM8k is close to GPT3.5-Turbo, and slightly worse on the new paper.
Except for the above-mentioned closed models, the other models in the test are all open source Code Llama seriesis Meta’s own version of Llama 2 It is basically fine-tuned, focusing on generating code based on natural language. Now it seems that the mathematical ability is slightly worse than models of the same scale.
Based on Code Llama, many universities and research institutions jointly launched the
Llemma series, which was open sourced by EleutherAI. The team collected the Proof-Pile-2 dataset from scientific papers, network data containing mathematics, and mathematical code. After training, Llemma can use tools and do formal theorem proofs without any further fine-tuning.
On the new paper, the performance of Llemma 34B is close to the GPT-3.5 Turbo level
Mistral series is trained by the French AI unicorn Mistral AI. The Apache2.0 open source agreement is more relaxed than Llama, becoming a sheep The most popular basic model in the open source community after the Tuo family. #OpenChat 3.5 and MetaMath Mistral are all fine-tuned based on the Mistral ecosystem. MetaMath and MAmmoTH Code are based on the Code Llama ecosystem. Those who choose to adopt open source large models in actual business need to be careful to avoid this group, because they are likely to perform well just to boost the rankings, but their actual capabilities may not be as strong as other models of the same scale Many netizens expressed their gratitude to Paster for this experiment, believing that this is exactly what is needed to understand the actual situation of the model. Some people have expressed concerns: specialized large model evaluation company with proprietary testing. Another proposal is to Establish a test benchmark that is updated year by year to alleviate the overfitting problem.
At the same time, he believes that the solution may be to have a
The above is the detailed content of One trick to distinguish large-scale cheating models, doctor's brother's open source AI mathematical "demon mirror". For more information, please follow other related articles on the PHP Chinese website!