search
HomeTechnology peripheralsAIThe accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create 'science' language models without fine-tuning

The most criticized shortcoming of large language models, apart from serious nonsense, is probably their "inability to do math".

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

For example, for a complex mathematical problem that requires multi-step reasoning, the language model usually cannot give the correct answer, even if there is With the blessing of "thinking chain" technology, errors often occur in the intermediate steps.

Different from natural language understanding tasks in the liberal arts, mathematics problems usually have only one correct answer, and the range of answers is not so open, making the task of generating accurate solutions difficult for large language models. Say it's more challenging.

Moreover, when it comes to mathematical problems, existing language models usually do not provide confidence for their answers, leaving users unable to judge the credibility of the generated answers.

In order to solve this problem, Microsoft Research proposed MathPrompter technology, which can improve the performance of LLM on arithmetic problems while increasing its reliance on prediction.

Paper link: https://arxiv.org/abs/2303.05398

MathPrompter uses Zero-shot thinking Chain hinting technology generates multiple algebraic expressions or Python functions to solve the same mathematical problem in different ways, thereby increasing the confidence of the output results.

Compared with other hint-based CoT methods, MathPrompter also checks the validity of intermediate steps.

Based on 175B parameter GPT, using MathPrompter method to increase the accuracy of MultiArith data set from 78.7% to 92.5%!

Prompt specializing in mathematics

In recent years, the development of natural language processing is largely due to the continuous expansion in scale of large language models (LLMs) , which demonstrated amazing zero-shot and few-shot capabilities, and also contributed to the development of prompting technology. Users only need to enter a few simple examples into LLM in prompt to predict new tasks.

prompt can be said to be quite successful for single-step tasks, but in tasks requiring multi-step reasoning, the performance of prompt technology is still insufficient.

When humans solve a complex problem, they will break it down and try to solve it step by step. The "Chain of Thought" (CoT) prompting technology extends this intuition to LLMs , performance improvements have been achieved across a range of NLP tasks requiring inference.

This paper mainly studies the Zero-shot-CoT method "for solving mathematical reasoning tasks". Previous work has achieved significant accuracy improvements on the MultiArith data set. It has improved from 17.7% to 78.7%, but there are still two key shortcomings:

#1. Although the thinking chain followed by the model improves the results, it does not check the thinking chain. Prompt the effectiveness of each step followed;

2. No confidence is provided for the LLM prediction results.

MathPrompter

To address these gaps to some extent, researchers took inspiration from "the way humans solve math problems" and broke down complex problems into simpler multi-step procedure, and utilizes multiple methods to validate the method at each step.

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

#Since LLM is a generative model, it becomes very tricky to ensure that the generated answers are accurate, especially for mathematical reasoning tasks.

Researchers observed the process of students solving arithmetic problems and summarized several steps students took to verify their solutions:

Compliance with known results By comparing the solution with known results, you can evaluate its accuracy and make necessary adjustments; when the problem is a This is especially useful when it comes to standard problems with mature solutions.

Multi-verification, by approaching the problem from multiple angles and comparing the results, helps confirm the effectiveness of the solution and ensures that it is both reasonable and precise.

Cross-checking, the process of solving the problem is as necessary as the final answer; verifying the correctness of the intermediate steps in the process can provide a clear understanding of the solution The thought process behind it.

Compute verification, using a calculator or computer to perform arithmetic calculations can help verify the accuracy of the final answer

Specifically, given a question Q,

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

## In a restaurant, the cost of each adult meal The price is $5, and children eat free. If 15 people come in and 8 of them are children, how much does it cost to eat for this group of people?

1. Generating Algebraic template

##First solve the problem Translated to algebraic form, by replacing the numeric terms with variables using a key-value map, we get the modified question Qt

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

# #2. Math-promptsBased on the intuition provided by the above thought process of multiple verification and cross-checking, two different methods are used to generate Qt The analytical solution, both algebraically and Pythonic, gives LLM the following hints to generate additional context for Qt.

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuningThe prompt can be "Derive an algebraic expression" or "Write a Python function"

LLM model can output the following expression after responding to the prompt.

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuningThe analysis plan generated above provides users with tips about the "intermediate thinking process" of LLM. Adding additional tips can improve the accuracy of the results. accuracy and consistency, which in turn improves MathPrompter's ability to generate more precise and efficient solutions.

3. Compute verificationUse multiple input variables in Qt A random key-value map to evaluate the expressions generated in the previous step, using Python's eval() method to evaluate these expressions.

Then compare the output results to see if you can find a consensus in the answer, which can also provide a higher degree of confidence that the answer is correct and reliable.

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

Once the expression agrees on the output, the variable values ​​in the input Q are used to calculate the final answer.

4. Statistical significance

To ensure consensus in the output of various expressions, Repeat steps 2 and 3 approximately 5 times in the experiment and report the most frequently observed answer value.

In the absence of clear consensus, repeat steps 2, 3, and 4.

Experimental results

MathPrompter was evaluated on the MultiArith data set. The mathematical questions in it were specifically used to test the machine learning model's ability to perform complex arithmetic operations and reasoning. Requires application of various arithmetic operations and logical reasoning to solve successfully.

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

The accuracy results on the MultiArith data set show that MathPrompter performs better than all Zero-shot and Zero -shot-CoT baseline, increasing the accuracy from 78.7% to 92.5%

It can be seen that the performance of the MathPrompter model based on 175B parameter GPT3 DaVinci is comparable to that of the 540B parameter model and SOTA's Few -shot-CoT method equivalent.

The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create science language models without fine-tuning

As you can see from the above table, the design of MathPrompter can make up for problems such as "The generated answers sometimes have one step difference. ” problem can be avoided by running the model multiple times and reporting consensus results.

In addition, the problem that the inference step may be too lengthy can be solved by Pythonic or Algebraic methods, which usually require fewer tokens

In addition, the inference steps may be correct, but the final calculation result is incorrect. MathPrompter solves this problem by using Python's eval() method function.

In most cases, MathPrompter can generate correct intermediate and final answers, but there are a few cases, such as the last question in the table, where the algebraic and Pythonic outputs are consistent. Yes, but there is an error.

The above is the detailed content of The accuracy of GPT-3 in solving math problems has increased to 92.5%! Microsoft proposes MathPrompter to create 'science' language models without fine-tuning. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
A Comprehensive Guide to ExtrapolationA Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayThe Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierEvolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgNew Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficThe 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DMIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.