Home  >  Article  >  Technology peripherals  >  The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

PHPz
PHPzforward
2024-03-15 12:07:23559browse
Synthetic data continues to unlock the mathematical reasoning potential of large models!

Mathematical problem-solving ability has always been regarded as an important indicator of the intelligence level of language models. Usually only very large models or models that have undergone extensive mathematical pre-training have a chance to perform well on mathematical problems.

Recently, a research work created by the Swin-Transformer team and completed by scholars from Xi'an Jiaotong University, University of Science and Technology of China, Tsinghua University and Microsoft Research Asia Xwin overturned this perception and revealed that the 7B (i.e. 7 billion parameters) scale language model (LLaMA-2-7B) under general pre-training has shown strong potential in solving mathematical problems and can be used based on synthesis The supervised fine-tuning method of data enables the model to stimulate mathematical capabilities more and more stably.

The study was published on arXiv, titled "Common 7B Language Models Already Possess Strong Math Capabilities."

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

  • Paper link: https://arxiv.org/pdf/2403.04706.pdf
  • Code Link: https://github.com/Xwin-LM/Xwin-LM

The research team first used only 7.5K data to perform LLaMA- 2-7B Model instructions are fine-tuned to evaluate the performance of the model in GSM8K and MATH. Experimental results show that when selecting the best answer from 256 generated answers for each question in the test set, the test accuracy can be as high as 97.7% and 72.0% respectively. This result shows that even under general pre-training, the 7B level The discovery that even small models have the potential to generate high-quality answers challenges the previous view that the potential for powerful mathematical reasoning is not limited to large-scale and mathematically related pre-trained models.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

However, research also points out that although it has strong mathematical reasoning potential, the main problem of current language models is that it is difficult to consistently stimulate its inherent mathematical capabilities. For example, if only one generated answer per question was considered in the previous experiment, the accuracy on the GSM8K and MATH benchmarks would drop to 49.5% and 7.9%, respectively. This reflects the instability of the model's mathematical capabilities. To solve this problem, the research team adopted the method of expanding the supervised fine-tuning (SFT) data set and found that with the increase in SFT data, the reliability of the model in generating correct answers was significantly improved.

The study also mentioned that by using synthetic data, the SFT data set can be effectively enlarged, and this method is almost as effective as real data. The research team used the GPT-4 Turbo API to generate synthetic mathematical questions and problem-solving processes, and ensured the quality of the questions through simple verification of prompt words. Through this method, the team successfully expanded the SFT data set from 7.5K to about one million samples, achieving a near-perfect scaling law. The resulting Xwin-Math-7B model achieved an accuracy of 82.6% and 40.6% on GSM8K and MATH respectively, significantly surpassing previous SOTA models and even surpassing some 70B models, achieving a leapfrog improvement. The Xwin-Math-70B model achieved a result of 52.8% on the MATH evaluation set, significantly surpassing the early version of GPT-4. This is the first time that research based on the LLaMA series of basic models has surpassed GPT-4 on MATH.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

The researchers also defined the Pass@N and PassRatio@N evaluation indicators, intending to evaluate whether the model can output the correct answer in the N outputs (indicating the potential of the model). mathematical ability), and the proportion of correct answers (indicating the stability of the model’s mathematical ability). When the amount of SFT data is small, the Pass@256 of the model is already very high. After further expanding the scale of SFT data, the Pass@256 of the model increases very little, while the PassRatio@256 increases significantly. This shows that supervised fine-tuning based on synthetic data is an effective way to improve the stability of the mathematical capabilities of the model.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

Additionally, the study provides insights into scaling behavior under different reasoning complexities and error types. For example, as the size of the SFT dataset increases, the model's accuracy in solving mathematical problems follows a power-law relationship with the number of inference steps. By increasing the proportion of long inference steps in the training samples, the accuracy of the model in solving difficult problems can be significantly improved. At the same time, the study also found that calculation errors are easier to mitigate than reasoning errors.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

In the Hungarian high school mathematics test that reflects the model’s mathematical reasoning generalization ability, Xwin-Math also scored 65%. Second only to GPT-4. This shows that the way the data was synthesized in the study did not significantly overfit to the evaluation set and showed good generalization ability.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

This research not only demonstrates the effectiveness of synthetic data in extending SFT data, but also provides a new perspective on the research of large language models in mathematical reasoning capabilities. The research team stated that their work laid the foundation for future exploration and progress in this field, and looked forward to promoting artificial intelligence to achieve greater breakthroughs in solving mathematical problems. As artificial intelligence technology continues to advance, we have reason to expect that AI will perform even better in the field of mathematics and provide more help to humans in solving complex mathematical problems.

The article also covers the results of ablation experiments and other evaluation indicators of the data synthesis method. Please refer to the full text for details.

The above is the detailed content of The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete