Mathematical reasoning is a core ability of human intelligence, but abstract thinking and logical reasoning are still a big challenge for machines. Large-scale pre-trained language models, such as GPT-3 and GPT-4, have made significant progress in text-based mathematical reasoning (such as mathematical word problems). However, it is currently unclear whether these models can handle more complex problems involving heterogeneous information such as tabular data. To fill this gap, researchers from UCLA and the Allen Institute for Artificial Intelligence (AI2) launched Tabular Math Word Problems (TabMWP), a dataset of 38,431 open-domain problems that require both text and Perform mathematical reasoning on tabular data to get the correct answer. Each question in TabMWP is associated with a context that contains an image, text, or table in a structured format.
Researchers evaluated different pre-trained models on TabMWP including Few-shot GPT-3. As existing research has found, Few-shot GPT-3 relies heavily on the selection of in-context examples, which results in its performance being quite unstable when examples are randomly selected. This instability is even more severe when dealing with complex inference problems like TabMWP. In order to solve this problem, the author proposed the PromptPG method, which converts the selection of examples into the contextual bandit problem in reinforcement learning, and uses Policy Gradient to train a policy network to learn to select the optimal in from a small amount of training data. -context example. Experimental results show that their proposed PromptPG method exceeds the optimal baseline (Few-shot CoT GPT-3) by 5.31% in answering questions, and their method significantly reduces the problem compared to randomly selected in-context examples. The variance of predictions improves the stability of this type of method.
- Paper link: https://arxiv.org/abs/2209.14610
- Code link: https://github.com/lupantech/PromptPG
- ## Project homepage: https://promptpg.github.io
- Data visualization: https://promptpg.github.io/explore
The following are two examples from the TabMWP data set. One is a free-text question with numerical answers, and the other is a multi-choice question with text answers. As you can see, each question provides a solution that includes step-by-step reasoning. To solve problems in TabMWP, the system must be capable of both table lookup and multi-step mathematical reasoning. Take the example in the picture below, to answer "how much will she spend (if Tracy buys three kinds of breads)", we need to first find the corresponding prices of the three kinds of bread in the table, and then calculate the cost of buying each kind of bread. costs and sum them to get the final cost.
As shown in the statistics in the table below, the TabMWP data set contains 38,431 tabular math problems. 74.7% of the questions were free-text questions and 25.3% were multiple-choice questions. TabMWP has a total of 28,876 unique questions, 6,153 unique answers, and 35,442 unique solutions, indicating its rich diversity in question distribution. The average length of the questions was 22.1 words and the average length of the answers was 49.5 words, indicating the lexical richness of TabMWP. A distinguishing feature of TabMWP is that each problem is accompanied by a table context, without which the problem cannot be solved. TabMWP has a total of 37,644 different tables, with an average table size of 5.9 rows and 2.2 columns, 12.9 cells, and a maximum of 54 cells. These statistics show that the tables in TabMWP are also rich in diversity.
The TabMWP dataset has two different question types and five different answer types:
Every question in TabMWP has a tabular context, which is represented in three formats: image, semi-structured text and structured. This opens the possibility to develop different types of inference models.
Compared with existing data sets, TabMWP requires both table understanding and mathematical reasoning abilities to answer questions. In addition, TabMWP has a detailed multi-step reasoning process for each question, which has obvious advantages in data set size, table type, question type and answer type. To the best of the knowledge of this paper, TabMWP is the first mathematical reasoning dataset in the open-domain tabular scenario.
2. PromptPG method
Considering the achievements of large-scale pre-trained models such as GPT-3 in solving mathematical application problems Successfully, the authors first established a benchmark on TabMWP using few-shot GPT-3. They randomly select some contextual examples from the training set as well as test examples to form prompts that prompt GPT-3 to predict answers. However, recent research shows that this kind of few-shot learning based on random selection may perform very unstable on different contextual example selections. Random selection may be even less effective when dealing with complex inference problems like TabMWP, which involve tables of different types and formats.
In order to solve this problem, the author proposed an improved method: Prompt learning through Policy Gradient, learning to select contextual examples from a small amount of training data, called for PromptPG. As shown in Figure 2, the policy network learns to find the best in-context example from the candidate pool (candidate examples), and its optimization goal is to maximize the prediction of a given training example (training example) when interacting with the GPT-3 environment award. The policy network for selecting examples is a BERT language model based on fixed parameters and a single-layer neural network with learnable parameters. After completing optimization learning, PromptPG can dynamically select different optimal examples from candidate examples for different test questions, thereby maximizing the inference performance of GPT-3.
The following is the learning algorithm of PromptPG.
3. Experiment and analysis
Pre-training and fine-tuning
Table 3 compares the results of PromptPG and different benchmarks on the TabMWP data set. It can be seen that TAPEX performs better than UnifiedQA due to pre-training on tabular data with similar parameter amounts. For both TAPEX and UnifiedQA, increasing the number of parameters in the model can improve the accuracy of predictions. In addition, fine-tuning the model on TabMWP can also greatly improve the accuracy of predictions.
Large-scale language model
GPT-3 without any fine-tuning (Zero-shot GPT- 3), it can achieve accuracy similar to the fine-tuned UnifiedQA and TAPEX models. If the Few-shot GPT-3 model randomly selects two in-context examples as GPT-3 hints, it can further improve by 0.17% compared to Zero-shot GPT-3. By having Few-shot GPT-3 generate multiple intermediate steps before generating the final answer (Few-shot-CoT GPT-3), the researchers were able to obtain an optimal baseline model with an accuracy of 62.92%.
PromptPG
Different from randomly selecting in-context examples, the PromptPG proposed in this article trains a policy network through Policy Gradient to select more appropriate in-context examples, and achieved the highest prediction result (68.23%) on TabMWP. Its average prediction accuracy exceeds the best baseline model (Few-shot-CoT GPT-3) by 5.31%. Notably, PromptPG demonstrates its superiority in prediction accuracy for almost all question types, answer types, and question difficulties. Despite this, PromptPG still has a lot of room for improvement from the human performance of 90.22%.
Ablation experiment
Table 4 shows that all input elements of TabMWP (question text, form information, option information) are all critical to answering the question correctly. Only with all problem elements as input information, Zero-shot GPT-3 achieved its relatively highest average prediction accuracy (59.50%).
Different sample selection
As a comparative experiment, the researchers also Other methods with different sample selections were compared. As shown in Table 5, choosing the same question type or answer type as the test question can help the model find more relevant examples and improve the accuracy of the answer. Choosing the most complex examples does not consistently improve answer accuracy. Fixed selection of the two best examples among the candidate examples can slightly improve accuracy and reduce variance. Selecting the example that is semantically closest to the test problem achieves the closest accuracy to the PromptPG method. Overall, PromptPG fully demonstrated its advantages in improving prediction accuracy and reducing prediction variance.
The following figure shows an example of PromptPG selection and the final prediction result. It can be seen that the PromptPG method can improve the inference performance of Few-shot GPT-3 by selecting examples with similar mathematical abilities to the test questions.
Example of successful prediction
The following shows PromptPG for a free Correct answers to text questions. This question requires adding and dividing eight numbers in a table to find the average.
In the following example, the model is asked to understand a tax report and calculate the salary after tax deductions.
The following shows PromptPG’s correct predictions for multiple-choice questions. The given table has a total of 9 rows and 6 columns. The model successfully locates the target cell in the table and performs multi-step inference to predict the correct answer.
In the following example, the model needs to compare the budget and total costs to verify whether Ariana has enough money.
Example of prediction failure
The following shows PromptPG for free text Misprediction of the problem. The model retrieved the wrong price for rose quartz, thereby miscalculating the total cost of the three items.
In the following example, the question provides an abstract stem-and-leaf table. The model was unable to understand this domain-specific table and lacked advanced logical reasoning capabilities to get the wrong answers.
#The following examples show that existing models do not seem to have the ability to sort numbers.
In the following example, the time exactly consistent with the current time mentioned in the question does not appear in the table, so the model cannot accurately locate the next time. Departure time for one stop.
#In the following example, it is difficult for the model to accurately complete arithmetic operations on a long series of numbers.
#4. Conclusion and outlook
The author proposed TabMWP, which is the first mathematical problem solving in tabular context. large-scale data sets. TabMWP contains 38,431 open-domain questions, including two question types and five answer types, with each question marked with a multi-step solution process. The authors used state-of-the-art QA and TableQA methods, conducted comprehensive experiments on TabMWP in pre-trained and fine-tuned settings, and evaluated using the large pre-trained language model GPT-3. The author further proposes a new reinforcement learning method, PromptPG, which uses Policy Gradient learning to select optimal instances from the training data for prompting the GPT-3 model. Experimental results show that PromptPG significantly outperforms existing baselines and reduces performance instability in predictions compared to random selection.
The above is the detailed content of PromptPG: When reinforcement learning meets large-scale language models. For more information, please follow other related articles on the PHP Chinese website!

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:https://spj.scien

译者 | 李睿审校 | 孙淑娟近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Dreamweaver CS6
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool
