GPT-4 has great mathematical ability! OpenAI's explosive research on 'Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations-AI-php.cn

GPT-4 has great mathematical ability! OpenAI's explosive research on 'Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

王林

Jun 03, 2023 pm 12:25 PM

gpt-4math

ChatGPT has been criticized for its mathematical abilities since its release.

Even "mathematical genius" Terence Tao once said that GPT-4 did not add much value in his own field of mathematics expertise.

What should I do, just let ChatGPT be a "mathematically retarded"?

OpenAI is working hard - In order to improve the mathematical reasoning capabilities of GPT-4, the OpenAI team uses "Process Supervision" (PRM) to train the model.

Let us verify step by step!

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Paper address: https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step .pdf

In the paper, the researchers trained the model to achieve better results in mathematical problem solving by rewarding each correct reasoning step, known as "process supervision", rather than just rewarding the correct final result (result supervision). Latest SOTA.

Specifically, PRM solved 78.2% of the problems in the representative subset of the MATH test set.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

# In addition, OpenAI found that "process supervision" is of great value in alignment - training the model to produce a chain of thoughts recognized by humans.

The latest research is of course indispensable for forwarding by Sam Altman, "Our Mathgen team has achieved very exciting results in process supervision, which is a positive sign of alignment."

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

In practice, "process supervision" requires manual feedback, which is extremely costly for large models and various tasks. Therefore, this work is of great significance and can be said to determine the future research direction of OpenAI.

Solving Mathematical Problems

In the experiment, the researchers used questions in the MATH data set to evaluate the reward models of "process supervision" and "result supervision".

Have the model generate many solutions for each problem and then pick the solution with the highest ranking for each reward model.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

The figure shows the percentage of selected solutions that resulted in a correct final answer as a function of the number of solutions considered.

The "process supervision" reward model not only performed better overall, but the performance gap widened as more solutions to each problem were considered.

This shows that the "process supervision" reward model is more reliable.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Below, OpenAI presents 10 mathematical problems and solutions for the model, as well as comments on the pros and cons of the reward model.

The model was evaluated from the following three types of indicators, true (TP), true negative (TN), and false positive (FP).

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

True (TP)

Let’s simplify the trigonometric function formula first.

This challenging trigonometric function problem requires the application of several identities in an unobvious order.

But most attempts at solution fail because it is difficult to choose which identities are actually useful.

While GPT-4 generally fails to solve this problem, with only 0.1% of solutions attempting to achieve the correct answer, the reward model correctly identifies this solution as valid.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Here, GPT-4 successfully performs a series of complex polynomial factorizations.

Using the Sophie-Germain identity in step 5 is an important step. It can be seen that this step is very insightful.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

In steps 7 and 8, GPT-4 starts performing guesses and checks.

This is a common place where the model can "hallucinate" and claim that a particular guess was successful. In this case, the reward model validates each step and determines that the chain of thought is correct.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

The model successfully applies several trigonometric identities to simplify the expression.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

True Negative (TN)

In step 7, GPT-4 attempts to simplify an expression, but the attempt fails. The reward model caught this bug.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

In step 11, GPT-4 made a simple calculation error. Also discovered by the reward model.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

GPT-4 attempted to use the squared difference formula in step 12, but this expression is not actually the squared difference.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

The rationale for step 8 is weird, but the bonus model makes it pass. However, in step 9, the model incorrectly factors the expression.

The reward model corrects this error.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

False Positive (FP)

In step 4, GPT-4 incorrectly claims that "the sequence repeats every 12 items ”, but actually repeats every 10 items. This counting error occasionally fools the reward model.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

In step 13, GPT-4 attempts to simplify the equation by combining similar terms. It correctly moves and combines the linear terms to the left, but incorrectly leaves the right side unchanged. The reward model is fooled by this error.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

GPT-4 tries to do long division, but in step 16 it forgets to include the leading zero in the repeating part of the decimal. The reward model is fooled by this error.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

GPT-4 made a subtle counting error in step 9.

On the surface, the claim that there are 5 ways to exchange balls of the same color (since there are 5 colors) seems reasonable.

However, this count is underestimated by a factor of 2 because Bob has 2 choices, namely deciding which ball to give to Alice. The reward model is fooled by this error.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Process Supervision

Although large language models have greatly improved in terms of complex reasoning capabilities, even the most advanced The model still produces logical errors or nonsense, which is often called "illusion".

In the craze of generative artificial intelligence, the illusion of large language models has always troubled people.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Musk said, what we need is TruthGPT

For example, recently, an American lawyer in a New York federal court filing It cited the fabricated case of ChatGPT and may face sanctions.

OpenAI researchers mentioned in the report: “These illusions are particularly problematic in fields that require multi-step reasoning, because a simple logic error can cause great damage to the entire solution. .”

Moreover, mitigating hallucinations is also the key to building consistent AGI.

How to reduce the illusion of large models? There are generally two methods - process supervision and result supervision.

"Result supervision", as the name suggests, is to give feedback to the large model based on the final results, while "process supervision" can provide feedback for each step in the thinking chain.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

In process supervision, large models are rewarded for their correct reasoning steps, not just their correct final conclusions. This process will encourage the model to follow more human-like thinking method chains, thus making it more likely to create better explainable AI.

OpenAI researchers said that although process supervision was not invented by OpenAI, OpenAI is working hard to push it forward.

In the latest research, OpenAI tried both methods of "result supervision" or "process supervision". And using the MATH data set as a test platform, a detailed comparison of the two methods is conducted.

The results found that "process supervision" can significantly improve model performance.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

For mathematical tasks, Process Supervision produced significantly better results for both large and small models, meaning that the models were generally correct, and also exhibits a more human-like thought process.

In this way, illusions or logical errors that are difficult to avoid even in the most powerful models can be reduced.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

The advantages of alignment are obvious

Researchers found that "process supervision" has several alignment advantages over "result supervision":

· Direct rewards follow a consistent thought chain model because each step in the process is precisely supervised.

· More likely to produce explainable reasoning because "process supervision" encourages models to follow human-approved processes. In contrast, outcomes monitoring may reward an inconsistent process and is often more difficult to review.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

It’s also worth mentioning that in some cases, methods of making AI systems safer may result in performance degradation. This cost is called an "alignment tax."

Generally speaking, any "alignment tax" cost may hinder the adoption of alignment methods in order to deploy the most capable model.

However, the researchers’ results below show that “process supervision” actually produces a “negative alignment tax” during testing in the mathematics domain.

It can be said that there is no major performance loss due to alignment.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

OpenAI releases 800,000 human-annotated data sets

It is worth noting that PRM requires more human annotations, or deep I cannot live without RLHF.

How applicable is process supervision in fields other than mathematics? This process requires further exploration.

OpenAI researchers have opened up this human feedback data set PRM, which contains 800,000 step-level correct annotations: 75K solutions generated from 12K math problems

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

The following is an example of annotation. OpenAI is releasing the raw annotations, along with instructions to annotators during Phases 1 and 2 of the project.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Popular comments from netizens

NVIDIA scientist Jim Fan made a summary of the latest research on OpenAI:

For Challenging step-by-step questions that give rewards at each step rather than a single reward at the end. Basically, dense reward signal > sparse reward signal. The Process Reward Model (PRM) can select solutions for difficult MATH benchmarks better than the Outcome Reward Model (ORM). The obvious next step is to fine-tune GPT-4 with PRM, which this article has not done yet. It should be noted that PRM requires more human annotation. OpenAI released a human feedback dataset: 800K step-level annotations on 75K solutions to 12K math problems.

##This is like an old saying often said in school, learn how to think .

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations

Training the model to think, rather than just output the correct answer, will become a game changer in solving complex problems.

##ChatGPT is super weak in math. Today I tried to solve a math problem from a 4th grade math book. ChatGPT gave the wrong answer. I checked my answers with answers from ChatGPT, answers from perplexity AI, Google, and my fourth grade teacher. It can be confirmed everywhere that chatgpt's answer is wrong.

GPT-4 has great mathematical ability! OpenAIs explosive research on Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations ## References:

https://www.php.cn/link/daf642455364613e2120c636b5a1f9c7

The above is the detailed content of GPT-4 has great mathematical ability! OpenAI's explosive research on 'Process Supervision” breaks through 78.2% of the problems and eliminates hallucinations. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

The AI Skills Gap Is Slowing Down Supply ChainsApr 26, 2025 am 11:13 AM

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

How One Company Is Quietly Working To Transform AI ForeverApr 26, 2025 am 11:12 AM

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Nvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentApr 26, 2025 am 11:11 AM

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI Paints A New Picture For The Future Of Art And DesignApr 26, 2025 am 11:10 AM

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

How Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesApr 26, 2025 am 11:09 AM

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

The Existential Threat To UniversitiesApr 26, 2025 am 11:08 AM

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The Prototype: American Scientists Are Looking For Jobs AbroadApr 26, 2025 am 11:07 AM

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

All About Open AI's Latest GPT 4.1 Family - Analytics VidhyaApr 26, 2025 am 10:19 AM

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

4 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

Hot Tools

WebStorm Mac version

Useful JavaScript development tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 English version

Recommended: Win version, supports code prompts!

Hot Topics

Where is the login entrance for gmail email?

7733

1643

1397

1290

1233