表现最佳大模型是它?
Next, let’s combine the specific examples given in MR-Ben’s work to see why the step-by-step method of generating final answers and related evaluations are unreliable.
As the evaluated "grading" model, GPT4-Turbo did not find the calculation errors hidden in the second step:
Model Solution Steps:
Step 5: Option D (37) meets all conditions and is the correct answer.
Model answer correctness: Incorrect
First wrong step: Step 2
Model answer error reason:
In step 2, the model incorrectly calculated that the remainder of 43 divided by 4 is 1 instead of 3. The model assumes that when 43 is divided by 4 the remainder is 1, but the remainder should be 3. There is a calculation error here.
When 43 is divided by 4, the remainder is 1. When 43 is divided by 3, the remainder is 1. When 43 is divided by 5, the remainder is 3, not 2. Therefore, 43 is not the correct answer.
Model reply:
-Step 1: Step check the remainder of 67 divided by 4 and 3, and find that it meets the condition of being divided by 4 and remaining 3, which is a correct verification.
-Step 2: Check the remainders of dividing 43 by 4, 3, and 5, and find that the remainders of 1, 3, and 5 correctly match the expectations, except that the remainder of 5 does not meet the conditions.
-Step 3: Step Calculate the remainder of dividing 57 by 4 and 3, the remainders are 1 and 0 respectively. When 57 is divided by 3 the remainder does not meet the criteria.
-Step 4: Check and verify the remainders of dividing 37 by 4, 3, and 5. It is found that all expected remainders are accurate, and 37 meets all the conditions in the question.
It can be seen that when the question is given and the solution process is determined, the evaluation The language model method is transformed into allowing the model to "mark" the answer process, judge whether it is correct or incorrect, and point out the location and cause of the error. The accuracy of the solution process and the potential error locations can be calculated by comparing with the annotation results. The evaluation of model error steps and reasons can be handed over to GPT4 to determine whether the model is correct by comparing the explanation of the error reasons given by the annotator and the explanation of the error reasons of the model.
From the evaluation method, the method proposed by MR-Ben requires the model to conduct a detailed analysis of the premises, assumptions, and logic of each step in the problem-solving process, and to preview the reasoning process to determine whether the current step can lead to the correct direction. Answer. fenye1. This "grading" evaluation method is far more difficult than the evaluation method of just answering questions, but it can effectively avoid the problem of falsely high scores caused by the model's memorization of questions. It is difficult for a student who can only memorize questions to become a qualified marking teacher.
The open source models released by Qwen and Deepseek are not inferior to the PK closed source model even in the global echelon.
The pricing strategies and actual performance of different closed-source models are intriguing. Friends who are concerned about reasoning ability in usage scenarios can find their favorite model to use based on price and capabilities.
In low-resource scenarios, small models also have many highlights. In the MR-Ben evaluation, Phi-3-mini stood out among the small models, even higher than or the same as large models with tens of billions of parameters, showing the ability to fine-tune data importance.
MR-Ben scenes contain complex logical analysis and step-by-step inference. Too long context in Few-shot mode will confuse the model and cause a decline in performance.
MR-Ben has evaluated many generation-reflection-regeneration ablation experiments to check the differences between different prompting strategies and found that it has no effect on low-level models, and the effect on high-level models such as GPT4-Turbo is not obvious. On the contrary, for intermediate-level models, the effect is slightly improved because the wrong ones are always corrected and the right ones are corrected.
After roughly dividing the subjects evaluated by MR-Ben into knowledge-based, logical, computational, and algorithmic types, different models have their own advantages and disadvantages in different reasoning types.
The Jiajiaya team has uploaded a one-click evaluation method on github. All partners who are concerned about complex reasoning are welcome to evaluate and submit their own models. The team will update the corresponding leaderboard in a timely manner.
By the way, one-click evaluation using the official script only costs about 12M tokens. The process is very smooth, so give it a try!
Reference
Training Verifiers to Solve Math Word Problems (https://arxiv.org/abs/2110.14168)
Measuring Massive Multitask Language Understanding (https://arxiv.org/abs/2009.03300)
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning(https://arxiv.org/abs/2007.08124)
MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation(https://arxiv.org/abs/2405.11430)
Sparks of Artificial General Intelligence: Early experiments with GPT-4(https://arxiv.org/abs/2303.12712)
Qwen Technical Report(https://arxiv.org/abs/2309.16609)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model(https://arxiv.org/abs/2405.04434)
Textbooks Are All You Need(https://arxiv.org/abs/2306.11644)
Large Language Models Cannot Self- Correct Reasoning Yet(https://arxiv.org/abs/2310.01798)
以上是贾佳亚团队联手剑桥清华等共推评测新范式 一秒侦破大模型'高分低能”的详细内容。更多信息请关注PHP中文网其他相关文章!