Terence Tao called him an expert after seeing it! Google and others used LLM to automatically prove theorems and won top conference outstanding papers. The more complete the context, the better the proof.

Terence Tao called him an expert after seeing it! Google and others used LLM to automatically prove theorems and won top conference outstanding papers. The more complete the context, the better the proof.

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Feb 04, 2024 am 09:30 AM

softwareai

Transformer’s skill tree is getting more and more powerful.

Researchers from the University of Massachusetts, Google, and the University of Illinois at Urbana-Champaign (UIUC) recently published a paper in which they successfully achieved The goal is to automatically generate complete theorem proofs.

Paper address: https://arxiv.org/pdf/2303.04910.pdf

This This work, named after Baldur (brother of Thor in Norse mythology), demonstrated for the first time that Transformer can generate full proofs, and also showed that previous proofs of the model can be improved when providing additional context for the model.

This paper was published at ESEC/FSE (ACM European Joint Conference on Software Engineering and Symposium on Fundamentals of Software Engineering) in December 2023, and won the Outstanding Paper Award.

#As we all know, bugs are inevitable in software, which may not cause too much of a problem for an average application or website. However, for the software behind critical systems, such as encryption protocols, medical devices, and space shuttles, we must ensure there are no bugs.

- General code review and testing cannot give this guarantee, which requires formal verification.

For formal verification, ScienceDirect’s explanation is:

the process of mathematically checking that the behavior of a system, described using a formal model, satisfies a given property, also described using a formal model

refers to the process of mathematically checking whether the system behavior described by the formal model satisfies the given property.

To put it simply, it uses mathematical analysis methods to build a model through an algorithm engine to conduct exhaustive analysis and verification of the state space of the design to be tested.

Formal software verification is one of the most challenging tasks for software engineers. For example, CompCert, a C compiler verified with the Coq interactive theorem prover, is the only compiler used by ubiquitous GCC and LLVM, among others.

However, the cost of manual formal verification (writing proofs) is quite huge - the proof of a C compiler is more than three times that of the compiler code itself.

Therefore, formal verification itself is a "labor-intensive" task, and researchers are also exploring automated methods.

Proof assistants such as Coq and Isabelle train a model to predict one proof step at a time and use the model to search the possible proof space.

Baldur in this article introduced the ability of large language models in this field for the first time, training on natural language text and code, and fine-tuning the proof,

Baldur can generate complete proofs of theorems in one go, rather than one step at a time.

As shown in the figure above, only use theorem statements as input to the proof generation model, then extract the proof attempts from the model, and use Isabelle to perform the proof examine.

If Isabelle accepts the proof attempt without errors, the proof is successful; otherwise, another proof attempt is extracted from the proof generation model.

Baldur is evaluated on a benchmark of 6336 Isabelle/HOL theorems and their proofs, empirically demonstrating the effectiveness of complete proof generation, repair and adding context.

In addition, the reason why this tool is called Baldur may be because the best automatic proof generation tool currently is called Thor.

Thor has a higher proof rate (57%), using a smaller language model combined with a method of searching the space of possible proofs to predict the next step in the proof, while Baldur's advantage is its ability to generate complete proofs.

But the brothers Thor and Baldur can also work together, which may increase the proof rate to close to 66%.

Automatically generate complete proofs

Baldur is powered by Minerva, Google’s large language model, which is used in scientific papers and web pages containing mathematical expressions. It was trained on and fine-tuned on data about proofs and theorems.

Baldur can work with theorem proving assistant Isabelle, who checks the proof results. When given a theorem statement, Baldur was able to generate a complete proof almost 41% of the time.

To further improve Baldur’s performance, the researchers provided the model with additional contextual information (such as other definitions, or theorem statements in theoretical documents ), which increases the proof rate to 47.5%.

This means that Baldur is able to take the context and use it to predict new correct proofs - similar to programmers who are more likely to do so when they understand the relevant methods and code Fix bugs in the program.

The following is an example (fun_sum_commute theorem):

This theorem comes from a project called Polynomials in the Formal Proof Archives.

When manually writing proofs, two cases are distinguished: the set is finite or not finite:

So, for the model, the input is the theorem statement, and the target output is this manually written proof.

Baldur recognized the need for induction here and applied a special induction law called infinite_finite_induct, which follows the same general approach as human written proofs, but is more concise.

Because of the need for induction, the Sledgehammer used by Isabelle cannot prove this theorem by default.

Training

To train the proof generation model, the researchers constructed a new proof generation dataset.

The existing dataset contains examples of a single proof step, and each training example includes the proof state (input) and the next proof step to apply (goal).

Given a dataset containing a single proof step, here you need to create a new dataset in order to train the model to predict the entire proof at once.

The researchers extracted the proof steps for each theorem from the dataset and concatenated them to reconstruct the original proof.

Proof of repair

Still take the above fun_sum_commute as an example,

Baldur's first generated proof attempt failed in the proof checker.

Baldur tried to apply induction but failed to first break down the proof into two cases (finite vs. infinite sets). Isabelle returns the following error message:

To derive a proof-repair training example from these strings, here the theorem statements, failed proof attempts, and error messages are concatenated as input, using the correct Human-written proofs as targets.

#The above figure details the creation process of training data.

Use a proof generation model to sample proofs with a temperature of 0 for each question in the original training set.

Use the Proofing Assistant to record all failed proofs and their error messages, then proceed to build a new proof-fix training set.

For each original training example, concatenate the theorem statement, the (incorrect) candidate proof generated by the proof generation model, and the corresponding error message to obtain input for the new training example sequence.

Add context

Add lines from the theory file before the theorem statement as additional context. For example, the picture below looks like this:

Baldur’s proof generation model with context can make use of this additional information. Strings that appear in the fun_sum_commute theorem statements appear again in this context, so the additional information surrounding them can help the model make better predictions.

Context can be a statement (theorem, definition, proof) or a natural language annotation.

To take advantage of LLM’s available input length, the researchers first added up to 50 statements from the same theory file.

During training, all these statements are first tokenized and then the left side of the sequence is truncated to fit the input length.

The above figure shows the relationship between the proof success rate and the number of proof attempts for the generative model with context and without context. We can see that proof generative models with context consistently outperform plain generative models.

The graph above shows the ratio of verified theorems to inference costs for models of different sizes and temperatures.

We can see the proof success rate of the generated model, as well as the relationship between the context of the 8B model and the 62B model and the number of proof attempts.

62B with context proves that the generative model outperforms the 8B model with context.

However, the authors emphasize here that due to the high cost of these experiments, they cannot adjust the hyperparameters, and the 62B model may perform better if it is optimized.

The above is the detailed content of Terence Tao called him an expert after seeing it! Google and others used LLM to automatically prove theorems and won top conference outstanding papers. The more complete the context, the better the proof.. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

undress free porn AI tool websiteMay 13, 2025 am 11:26 AM

https://undressaitool.ai/ is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How to create pornographic images/videos using undressAIMay 13, 2025 am 11:26 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

undress AI official website entrance website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How does undressAI generate pornographic images/videos?May 13, 2025 am 11:26 AM

undressAI porn AI official website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

UndressAI usage tutorial guide articleMay 13, 2025 am 10:43 AM

[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyrightMay 13, 2025 am 01:57 AM

The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

Explaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsMay 13, 2025 am 01:53 AM

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.