The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data-AI-php.cn

Home

Technology peripherals

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

PHPz

Mar 15, 2024 pm 12:07 PM

project

Synthetic data continues to unlock the mathematical reasoning potential of large models!

Mathematical problem-solving ability has always been regarded as an important indicator of the intelligence level of language models. Usually only very large models or models that have undergone extensive mathematical pre-training have a chance to perform well on mathematical problems.

Recently, a research work created by the Swin-Transformer team and completed by scholars from Xi'an Jiaotong University, University of Science and Technology of China, Tsinghua University and Microsoft Research Asia Xwin overturned this perception and revealed that the 7B (i.e. 7 billion parameters) scale language model (LLaMA-2-7B) under general pre-training has shown strong potential in solving mathematical problems and can be used based on synthesis The supervised fine-tuning method of data enables the model to stimulate mathematical capabilities more and more stably.

The study was published on arXiv, titled "Common 7B Language Models Already Possess Strong Math Capabilities."

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

Paper link: https://arxiv.org/pdf/2403.04706.pdf
Code Link: https://github.com/Xwin-LM/Xwin-LM

The research team first used only 7.5K data to perform LLaMA- 2-7B Model instructions are fine-tuned to evaluate the performance of the model in GSM8K and MATH. Experimental results show that when selecting the best answer from 256 generated answers for each question in the test set, the test accuracy can be as high as 97.7% and 72.0% respectively. This result shows that even under general pre-training, the 7B level The discovery that even small models have the potential to generate high-quality answers challenges the previous view that the potential for powerful mathematical reasoning is not limited to large-scale and mathematically related pre-trained models.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

However, research also points out that although it has strong mathematical reasoning potential, the main problem of current language models is that it is difficult to consistently stimulate its inherent mathematical capabilities. For example, if only one generated answer per question was considered in the previous experiment, the accuracy on the GSM8K and MATH benchmarks would drop to 49.5% and 7.9%, respectively. This reflects the instability of the model's mathematical capabilities. To solve this problem, the research team adopted the method of expanding the supervised fine-tuning (SFT) data set and found that with the increase in SFT data, the reliability of the model in generating correct answers was significantly improved.

The study also mentioned that by using synthetic data, the SFT data set can be effectively enlarged, and this method is almost as effective as real data. The research team used the GPT-4 Turbo API to generate synthetic mathematical questions and problem-solving processes, and ensured the quality of the questions through simple verification of prompt words. Through this method, the team successfully expanded the SFT data set from 7.5K to about one million samples, achieving a near-perfect scaling law. The resulting Xwin-Math-7B model achieved an accuracy of 82.6% and 40.6% on GSM8K and MATH respectively, significantly surpassing previous SOTA models and even surpassing some 70B models, achieving a leapfrog improvement. The Xwin-Math-70B model achieved a result of 52.8% on the MATH evaluation set, significantly surpassing the early version of GPT-4. This is the first time that research based on the LLaMA series of basic models has surpassed GPT-4 on MATH.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

The researchers also defined the Pass@N and PassRatio@N evaluation indicators, intending to evaluate whether the model can output the correct answer in the N outputs (indicating the potential of the model). mathematical ability), and the proportion of correct answers (indicating the stability of the model’s mathematical ability). When the amount of SFT data is small, the Pass@256 of the model is already very high. After further expanding the scale of SFT data, the Pass@256 of the model increases very little, while the PassRatio@256 increases significantly. This shows that supervised fine-tuning based on synthetic data is an effective way to improve the stability of the mathematical capabilities of the model.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

Additionally, the study provides insights into scaling behavior under different reasoning complexities and error types. For example, as the size of the SFT dataset increases, the model's accuracy in solving mathematical problems follows a power-law relationship with the number of inference steps. By increasing the proportion of long inference steps in the training samples, the accuracy of the model in solving difficult problems can be significantly improved. At the same time, the study also found that calculation errors are easier to mitigate than reasoning errors.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

In the Hungarian high school mathematics test that reflects the model’s mathematical reasoning generalization ability, Xwin-Math also scored 65%. Second only to GPT-4. This shows that the way the data was synthesized in the study did not significantly overfit to the evaluation set and showed good generalization ability.

The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data

This research not only demonstrates the effectiveness of synthetic data in extending SFT data, but also provides a new perspective on the research of large language models in mathematical reasoning capabilities. The research team stated that their work laid the foundation for future exploration and progress in this field, and looked forward to promoting artificial intelligence to achieve greater breakthroughs in solving mathematical problems. As artificial intelligence technology continues to advance, we have reason to expect that AI will perform even better in the field of mathematics and provide more help to humans in solving complex mathematical problems.

The article also covers the results of ablation experiments and other evaluation indicators of the data synthesis method. Please refer to the full text for details.

The above is the detailed content of The upper limit of LLaMA-2-7B math ability has reached 97.7%? Xwin-Math unlocks potential with synthetic data. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete

Reading The AI Index 2025: Is AI Your Friend, Foe, Or Co-Pilot?Apr 11, 2025 pm 12:13 PM

The 2025 Artificial Intelligence Index Report released by the Stanford University Institute for Human-Oriented Artificial Intelligence provides a good overview of the ongoing artificial intelligence revolution. Let’s interpret it in four simple concepts: cognition (understand what is happening), appreciation (seeing benefits), acceptance (face challenges), and responsibility (find our responsibilities). Cognition: Artificial intelligence is everywhere and is developing rapidly We need to be keenly aware of how quickly artificial intelligence is developing and spreading. Artificial intelligence systems are constantly improving, achieving excellent results in math and complex thinking tests, and just a year ago they failed miserably in these tests. Imagine AI solving complex coding problems or graduate-level scientific problems – since 2023

Getting Started With Meta Llama 3.2 - Analytics VidhyaApr 11, 2025 pm 12:04 PM

Meta's Llama 3.2: A Leap Forward in Multimodal and Mobile AI Meta recently unveiled Llama 3.2, a significant advancement in AI featuring powerful vision capabilities and lightweight text models optimized for mobile devices. Building on the success o

AV Bytes: Meta's Llama 3.2, Google's Gemini 1.5, and MoreApr 11, 2025 pm 12:01 PM

This week's AI landscape: A whirlwind of advancements, ethical considerations, and regulatory debates. Major players like OpenAI, Google, Meta, and Microsoft have unleashed a torrent of updates, from groundbreaking new models to crucial shifts in le

The Human Cost Of Talking To Machines: Can A Chatbot Really Care?Apr 11, 2025 pm 12:00 PM

The comforting illusion of connection: Are we truly flourishing in our relationships with AI? This question challenged the optimistic tone of MIT Media Lab's "Advancing Humans with AI (AHA)" symposium. While the event showcased cutting-edg

Understanding SciPy Library in PythonApr 11, 2025 am 11:57 AM

Introduction Imagine you're a scientist or engineer tackling complex problems – differential equations, optimization challenges, or Fourier analysis. Python's ease of use and graphics capabilities are appealing, but these tasks demand powerful tools

3 Methods to Run Llama 3.2 - Analytics VidhyaApr 11, 2025 am 11:56 AM

Meta's Llama 3.2: A Multimodal AI Powerhouse Meta's latest multimodal model, Llama 3.2, represents a significant advancement in AI, boasting enhanced language comprehension, improved accuracy, and superior text generation capabilities. Its ability t

Automating Data Quality Checks with DagsterApr 11, 2025 am 11:44 AM

Data Quality Assurance: Automating Checks with Dagster and Great Expectations Maintaining high data quality is critical for data-driven businesses. As data volumes and sources increase, manual quality control becomes inefficient and prone to errors.

Do Mainframes Have A Role In The AI Era?Apr 11, 2025 am 11:42 AM

Mainframes: The Unsung Heroes of the AI Revolution While servers excel at general-purpose applications and handling multiple clients, mainframes are built for high-volume, mission-critical tasks. These powerful systems are frequently found in heavil

See all articles