


ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com
Paper title: GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers Paper address: https://arxiv.org/pdf/2402.19255 Paper homepage: https: //qtli.github.io/GSM-Plus/
Numeric substitution: replacing numeric values with the same digits and types, such as Replace "16" with "20" in the question. Digit expansion: Increase the number of digits in a value, for example, replace "16" with "1600". Integer - Decimal - Fraction conversion: Replace integers with decimals or fractions, for example convert "2" to "2.5".
Operation expansion: Add restrictions to the original problem. For example, add a new condition "She also uses two eggs to make homemade hair masks every day." Operation reversal: Convert a known condition of the original problem into the variables to be solved for the GSM-Plus variant problem. For example, the statement of the original question in Figure 2 "2 US dollars per duck egg" is converted into the interrogative sentence of the new question "What is the price of each duck egg?", while the interrogative sentence of the original question "How many dollars do you earn at the farmer's market every day?" is converted into Known conditions for the new problem "She earns $18 a day at the farmer's market"
- GSM-Plus Features
Fine-grained evaluation
: Compared with GSM8K, the problem variant of GSM-Plus is more challenging, and the performance of all LLMs participating in the evaluation drops significantly. In the following analysis, this article will specifically analyze the problem-solving robustness of LLMs under different types of perturbations.
-
Table 1: Different colors represent different perturbation types: numeric substitution,
digit expansion,
integer-decimal-fraction conversion,
operation expansion,
operation reversal,
Problem understanding,
Distractor insertion,
Critical thinking.
As can be seen from the above table, previous studies used different perturbations to test the robustness of mathematical reasoning, but the evaluation settings only cover some perturbation types, and most of them introduce perturbations through automatic method construction, quality Hard to guarantee. In contrast, GSM-Plus uses eight different mathematical reasoning skills to perturb a single problem, with more comprehensive coverage and strict quality control. Experimental Analysis Evaluation Metrics Performance Reduction Rate (PDR): Performance of LLMs on the perturbed problem compared to the original problem fall degree. Percentage of simultaneously solved problem pairs (ASP): The proportion of the original problem and its corresponding problem variant that are both answered correctly by LLMs.
Overall Performance As shown in the table below, the performance of most LLMs on GSM-Plus drops significantly compared to GSM8K. GPT-4 shows the highest robustness, with the smallest PDR of only 8.23%. CodeLlama has the largest PDR, among which the 7B, 13B and 34B models are 40.56%, 39.71% and 34.27% respectively, exceeding its base model LLaMA-2-7B (39.49%), as well as the mathematical SFT model fine-tuned on it. , such as SEGO-7B (34.91%). This shows that reasoning using only procedural languages is vulnerable to perturbations. In the face of mathematical perturbations, the larger the model size, the more stable the performance. Although supervised fine-tuning can improve accuracy on downstream tasks, it does not significantly enhance the model's robustness to perturbations (i.e., lower PDR). Data that supervises fine-tuning is important for robustness. It is also fine-tuned based on LLaMA-2 and uses different data, which will lead to large differences in the accuracy and robustness of the model. Table 2: Overall performance Performance of LLMs under disturbance This paper further evaluates LLMs in 8 types of Performance stability under problem variants. Compared to human baseline for Critical Thinking (purple), Operation Expansion and Operation Reversal (blue), Distractor Insertion (pink), and Integer-Decimal-Fraction Conversion (orange) perturbation, the performance of LLMs decreases significantly. For "numeric replacement" and "problem understanding", the performance of LLMs is stable or even slightly improved. The previous analysis is mainly based on the entire data set. Next, this article splits the two data sets according to whether the math questions are answered correctly, and analyzes whether when LLMs successfully solve the GSM8K problem, it means that the probability of correctly answering the GSM-Plus variant question becomes higher (i.e., a high ASP value). vice versa. If this assertion holds true, LLMs can be considered to perform stably on this specific subset of mathematical problems, even if this is not the case on the entire data set. In the experimental setup, each GSM8K problem and its variants in GSM-Plus are transformed into 8 problem pairs, and the results are shown in Figure 4 . Figure 4: Inference transferability of LLMs between GSM8K and GSM-Plus problem pairs. Purple (both correct) and blue (both incorrect) bars indicate consistent model behavior, while red (GSM8K correct & GSM-Plus incorrect) and yellow (GSM8K incorrect & GSM-Plus correct) bars indicate Inconsistent model behavior. The sum of the heights of the purple and red bars represents the number of LLMs that correctly solved the GSM8K problem.
The presence of red bars (LLMs that correctly answer the original question, but do not address the variant), indicates that most models have limited performance transferability. Although the performance of LLMs differs on the GSM8K problem (height of purple and red bars), performance transferability is similar (height of red bars). This means that existing benchmarks cannot accurately assess a model's true capabilities in mathematical reasoning. High accuracy does not equate to strong inference robustness.Hints help in performance robustness of LLMs Previous work has shown that good hint instructions are important to stimulate the mathematical ability of language models. This article selects 4 representative models and tests their performance in solving problems under different prompt instructions. As shown in the figure below, when faced with interference, LLMs perform most stably when using complex examples as contextual demonstrations (Complexity-based CoT); in contrast, only using program language to represent intermediate reasoning (Program-of-Thought) , LLMs are more susceptible to interference. Overall, these tips and tricks are not enough for LLMs to maintain the same performance as GSM8K on GSM-Plus. L Figure 5: The impact of prompt on LLMS performance robustness Is the combination prompt valid?
This article found that LLMs often ignore important conditions or make calculation errors during the problem solving process. To this end, this paper explores Comp, a combined prompting method. The method first prompts LLMs to extract numerically relevant necessary conditions in the problem (Prompt1). Next, based on the problem and critical conditions, LLMs are instructed to iteratively generate inference goals (Prompt2) and calculation goals (Prompt3), and let them provide feedback on the generated historical problem-solving steps to determine whether the final answer is obtained (Prompt4). The specific implementation is shown in Figure 6.How to enhance the robustness of LLMs based on existing hint methods?
It can be seen that Comp can improve the performance of LLMs under various problem change types through iterative generation and self-verification, but it still The performance gap between LLMs on standard test sets and adversarial test sets cannot be bridged. This research looks forward to more methods in the future to further improve the robustness of the model and promote the further development of LLMs in the field of mathematical reasoning. Jadual 3: Prestasi lelaran Comp menggesa Plus untuk menulis semula soalan, di bawah teknik gesaan yang berbeza Prestasi GPT-3.5-Turbo. Walaupun semua gesaan mendorong Turbo untuk menjawab soalan GSM8K dengan tepat, hanya Comp yang dapat membantu Turbo menjana jawapan yang betul pada soalan varian GSM-Plus. Artikel ini memperkenalkan set penilaian soalan aplikasi matematik sekolah rendah yang bermusuhan GSM -Plus, yang direka bentuk untuk menyelesaikan masalah matematik secara sistematik bagi LLM. Analisis percubaan mendapati bahawa prestasi kebanyakan LLM menurun dengan ketara berbanding prestasi mereka pada penanda aras standard apabila berhadapan dengan gangguan, jauh lebih rendah daripada tahap prestasi manusia. Para penyelidik berharap bahawa kerja artikel ini dapat mempromosikan lebih banyak penyelidikan masa depan, termasuk tetapi tidak terhad kepada: (1) penilaian sistematik kemahiran matematik LLMs; (2) membina model yang boleh melakukan penaakulan matematik secara fleksibel. Pautan rujukan[1] Cobbe, Karl, et al. com/sota/aritmetik-penaakulan-pada-gsm8k [2] George Polya 2004. Cara menyelesaikannya: Aspek baharu kaedah matematik, jilid 85. Princeton university press..
The above is the detailed content of ACL 2024 | In the mathematical evaluation of 25 open and closed source models, GPT-3.5-Turbo barely passed. For more information, please follow other related articles on the PHP Chinese website!

HiddenLayer's groundbreaking research exposes a critical vulnerability in leading Large Language Models (LLMs). Their findings reveal a universal bypass technique, dubbed "Policy Puppetry," capable of circumventing nearly all major LLMs' s

The push for environmental responsibility and waste reduction is fundamentally altering how businesses operate. This transformation affects product development, manufacturing processes, customer relations, partner selection, and the adoption of new

The recent restrictions on advanced AI hardware highlight the escalating geopolitical competition for AI dominance, exposing China's reliance on foreign semiconductor technology. In 2024, China imported a massive $385 billion worth of semiconductor

The potential forced divestiture of Chrome from Google has ignited intense debate within the tech industry. The prospect of OpenAI acquiring the leading browser, boasting a 65% global market share, raises significant questions about the future of th

Retail media's growth is slowing, despite outpacing overall advertising growth. This maturation phase presents challenges, including ecosystem fragmentation, rising costs, measurement issues, and integration complexities. However, artificial intell

An old radio crackles with static amidst a collection of flickering and inert screens. This precarious pile of electronics, easily destabilized, forms the core of "The E-Waste Land," one of six installations in the immersive exhibition, &qu

Google Cloud's Next 2025: A Focus on Infrastructure, Connectivity, and AI Google Cloud's Next 2025 conference showcased numerous advancements, too many to fully detail here. For in-depth analyses of specific announcements, refer to articles by my

This week in AI and XR: A wave of AI-powered creativity is sweeping through media and entertainment, from music generation to film production. Let's dive into the headlines. AI-Generated Content's Growing Impact: Technology consultant Shelly Palme


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

Atom editor mac version download
The most popular open source editor

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
