search
HomeTechnology peripheralsAILatest research, GPT-4 exposes shortcomings! Can't quite understand the language ambiguity!

Natural Language Inference (NLI) is an important task in natural language processing. Its goal is to determine whether the hypothesis can be inferred from the premises based on the given premises and assumptions. However, since ambiguity is an intrinsic feature of natural language, dealing with ambiguity is also an important part of human language understanding. Due to the diversity of human language expressions, ambiguity processing has become one of the difficulties in solving natural language reasoning problems. Currently, various natural language processing algorithms are applied in scenarios such as question and answer systems, speech recognition, intelligent translation, and natural language generation, but even with these technologies, completely resolving ambiguity is still an extremely challenging task.

For NLI tasks, large natural language processing models such as GPT-4 do face challenges. One problem is that language ambiguity makes it difficult for models to accurately understand the true meaning of sentences. In addition, due to the flexibility and diversity of natural language, various relationships may exist between different texts, which makes the data set in the NLI task extremely complex. It also affects the universality and versatility of the natural language processing model. Generalization capabilities pose significant challenges. Therefore, in dealing with ambiguous language, it will be crucial if large models are successful in the future, and large models have been widely used in fields such as conversational interfaces and writing aids. Dealing with ambiguity will help adapt to different contexts, improve the clarity of communication, and the ability to identify misleading or deceptive speech.

The title of this paper discussing ambiguity in large models uses a pun, "We're Afraid...", which not only expresses the current concerns about the difficulty of language models in accurately modeling ambiguity, but also implies that the paper The language structure described. This article also shows that people are working hard to develop new benchmarks to truly challenge powerful new large models in order to understand and generate natural language more accurately and achieve new breakthroughs in models.

Paper title: We're Afraid Language Models Aren't Modeling Ambiguity

Paper link: https://arxiv.org/abs/2304.14399

Code and data address : https://github.com/alisawuffles/ambient

The author of this article plans to study whether the pre-trained large model has the ability to recognize and distinguish sentences with multiple possible interpretations, and evaluate how the model distinguishes different readings and interpretations . However, existing benchmark data often does not contain ambiguous examples, so one needs to build one's own experiments to explore this issue.

The traditional NLI three-way annotation scheme refers to a labeling method used for natural language inference (NLI) tasks. It requires the annotator to choose one label among three labels to represent the original text and the hypothesis. relationship between. The three labels are usually "entailment", "neutral" and "contradiction".

The authors used the format of the NLI task to conduct experiments, adopting a functional approach to characterize ambiguity through the impact of ambiguity in premises or assumptions on implication relationships. The authors propose a benchmark called AMBIENT (Ambiguity in Entailment) that covers a variety of lexical, syntactic, and pragmatic ambiguities, and more broadly covers sentences that may convey multiple different messages.

As shown in Figure 1, ambiguity can be an unconscious misunderstanding (top of Figure 1) or it can be deliberately used to mislead the audience (bottom of Figure 1). For example, if a cat gets lost after leaving home, then it is lost in the sense that it cannot find its way home (implication edge); if it has not returned home for several days, then it is lost in the sense that others cannot find it. In a sense, it is also lost (neutral side).

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Figure 1 Example of ambiguity explained by Cat Lost

AMBIENT Dataset Introduction

Selected Example

The authors provide 1645 sentence examples covering multiple types of ambiguity, including handwriting samples and from existing NLI datasets and linguistics textbooks. Each example in AMBIENT contains a set of labels corresponding to various possible understandings, and a disambiguation rewrite for each understanding, as shown in Table 1.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 1 Premises and Assumptions in Selected Examples

Generated Examples

The researchers also used overgeneration and filtering approach to build a large corpus of unlabeled NLI examples to more comprehensively cover different ambiguity situations. Inspired by previous work, they automatically identify pairs of premises that share reasoning patterns and enhance the quality of the corpus by encouraging the creation of new examples with the same patterns.

Comments and Verification

Annotations and annotations are required for the examples obtained in the previous steps. This process involved annotation by two experts, verification and summary by one expert, and verification by some authors. Meanwhile, 37 linguistics students selected a set of labels for each example and provided disambiguation rewrites. All these annotated examples were filtered and verified, resulting in 1503 final examples.

The specific process is shown in Figure 2: First, use InstructGPT to create unlabeled examples, and then two linguists independently annotate them. Finally, through integration by an author, the final annotations and annotations are obtained.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Figure 2 Annotation process of generating examples in AMBIENT

In addition, the issue of consistency of annotation results between different annotators is also discussed here, as well as AMBIENT The type of ambiguity present in the data set. The author randomly selected 100 samples in this data set as the development set, and the remaining samples were used as the test set. Figure 3 shows the distribution of set labels, and each sample has a corresponding inference relationship label. Research shows that in the case of ambiguity, the annotation results of multiple annotators are consistent, and using the joint results of multiple annotators can improve annotation accuracy.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Figure 3 Distribution of collection labels in AMBIENT

Does ambiguity illustrate "disagree"?

This study analyzes the behavior of annotators when annotating ambiguous input under the traditional NLI three-way annotation scheme. The study found that annotators can be aware of ambiguity and that ambiguity is a major cause of labeling differences, thus challenging the popular assumption that "disagreement" is the source of uncertainty in simulated examples.

In the study, the AMBIENT data set was used and 9 crowdsourcing workers were hired to annotate each ambiguous example.

The task is divided into three steps:

  1. Annotate ambiguous examples
  2. Identify possible different interpretations
  3. Annotate disambiguated examples

Among them, in step 2, the three possible explanations include two possible meanings and a similar but not identical sentence. Finally, for each possible explanation, it is substituted into the original example to obtain three new NLI examples, and the annotator is asked to choose a label respectively.

The results of this experiment support the hypothesis: under a single labeling system, the original fuzzy examples will produce highly inconsistent results, that is, in the process of labeling sentences, people are prone to ambiguous sentences. Different judgments lead to inconsistent results. However, when a disambiguation step was added to the task, annotators were generally able to identify and verify multiple possibilities for a sentence, and the inconsistencies in the results were largely resolved. Therefore, disambiguation is an effective way to reduce the impact of annotator subjectivity on the results.

Evaluate the performance on large models

The focus of this part is to test the language model to directly generate disambiguation in context and the learning ability of the corresponding label. To this end, the authors built a natural cue and validated the model's performance using automatic and manual evaluation, as shown in Table 2.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 2 A few-shot template for generating disambiguation tasks when the premise is unclear

In the test, each example has 4 other test examples serve as context, and scores and correctness are calculated using the EDIT-F1 metric and human evaluation. The experimental results shown in Table 3 show that GPT-4 performed best in the test, achieving an EDIT-F1 score of 18.0% and a human evaluation accuracy of 32.0%. In addition, it has been observed that large models often adopt the strategy of adding additional context during disambiguation to directly confirm or deny hypotheses. However, it is important to note that human evaluation may overestimate the model's ability to accurately report sources of ambiguity.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 3 Performance of large models on AMBIENT

Q2. Can the validity of reasonable explanations be identified?

This part mainly studies the performance of large models in identifying ambiguous sentences. By creating a series of templates of true and false statements and zero-shot testing the model, the researchers evaluated how well the large model performed in choosing predictions between true and false. Experimental results show that the best model is GPT-4, however, when ambiguity is taken into account, GPT-4 performs worse than random guessing in answering ambiguous interpretations of all four templates. In addition, large models have consistency problems in terms of questions. For different interpretation pairs of the same ambiguous sentence, the model may have internal contradictions.

These findings suggest that we need further research on how to improve the understanding of ambiguous sentences by large models and better evaluate the performance of large models.

Q3. Simulate open-ended continuous generation through different interpretations

This part mainly studies the ambiguity understanding ability based on language models. Language models are tested given context and compare their predictions of text continuation under different possible interpretations. In order to measure the model's ability to handle ambiguity, the researchers used KL divergence to measure the "surprise" of the model by comparing the probability and expectation differences produced by the model under a given ambiguity and a given correct context in the corresponding context. , and introduced "interference sentences" that randomly replace nouns to further test the model's ability.

The experimental results show that FLAN-T5 has the highest accuracy, but the performance results of different test suites (LS involves synonym replacement, PC involves correction of spelling errors, and SSD involves correction of grammatical structures) and different models are inconsistent, indicating that Ambiguity remains a serious challenge for models.

Multi-label NLI model experiment

As shown in Table 4, there is still much room for improvement in fine-tuning the NLI model on existing data with label changes, especially in multi-label NLI tasks. .

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 4 Performance of multi-label NLI model on AMBIENT

Detecting misleading political speech

This experiment studied Different ways of understanding political speech demonstrate that models that are sensitive to different ways of understanding can be effectively exploited. The research results are shown in Table 5. For ambiguous sentences, some explanatory interpretations can naturally eliminate the ambiguity, because these interpretations can only retain the ambiguity or clearly express a specific meaning.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 5 The political speech marked as ambiguous by the detection method of this article

In addition, the interpretation of this prediction can reveal the source of the ambiguity. By further analyzing the results of false positives, the authors also found many ambiguities that were not mentioned in fact checks, illustrating the great potential of these tools in preventing misunderstandings.

Summary

As pointed out in this article, the ambiguity of natural language will be a key challenge in model optimization. We expect that in the future technological development, natural language understanding models will be able to more accurately identify the context and key points in texts, and show higher sensitivity when processing ambiguous texts. Although we have established a benchmark for evaluating natural language processing models for identifying ambiguity and are able to better understand the limitations of models in this domain, this remains a very challenging task.

Xi Xiaoyao Technology Talk Original

Author | IQ dropped all over the place, Python

The above is the detailed content of Latest research, GPT-4 exposes shortcomings! Can't quite understand the language ambiguity!. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsCalculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTAn easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionExplaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI ​​assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesHow do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing Now5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTAn easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.