Home >Technology peripherals >AI >Latest research, GPT-4 exposes shortcomings! Can't quite understand the language ambiguity!

Latest research, GPT-4 exposes shortcomings! Can't quite understand the language ambiguity!

PHPz
PHPzforward
2023-05-11 21:52:041605browse

Natural Language Inference (NLI) is an important task in natural language processing. Its goal is to determine whether the hypothesis can be inferred from the premises based on the given premises and assumptions. However, since ambiguity is an intrinsic feature of natural language, dealing with ambiguity is also an important part of human language understanding. Due to the diversity of human language expressions, ambiguity processing has become one of the difficulties in solving natural language reasoning problems. Currently, various natural language processing algorithms are applied in scenarios such as question and answer systems, speech recognition, intelligent translation, and natural language generation, but even with these technologies, completely resolving ambiguity is still an extremely challenging task.

For NLI tasks, large natural language processing models such as GPT-4 do face challenges. One problem is that language ambiguity makes it difficult for models to accurately understand the true meaning of sentences. In addition, due to the flexibility and diversity of natural language, various relationships may exist between different texts, which makes the data set in the NLI task extremely complex. It also affects the universality and versatility of the natural language processing model. Generalization capabilities pose significant challenges. Therefore, in dealing with ambiguous language, it will be crucial if large models are successful in the future, and large models have been widely used in fields such as conversational interfaces and writing aids. Dealing with ambiguity will help adapt to different contexts, improve the clarity of communication, and the ability to identify misleading or deceptive speech.

The title of this paper discussing ambiguity in large models uses a pun, "We're Afraid...", which not only expresses the current concerns about the difficulty of language models in accurately modeling ambiguity, but also implies that the paper The language structure described. This article also shows that people are working hard to develop new benchmarks to truly challenge powerful new large models in order to understand and generate natural language more accurately and achieve new breakthroughs in models.

Paper title: We're Afraid Language Models Aren't Modeling Ambiguity

Paper link: https://arxiv.org/abs/2304.14399

Code and data address : https://github.com/alisawuffles/ambient

The author of this article plans to study whether the pre-trained large model has the ability to recognize and distinguish sentences with multiple possible interpretations, and evaluate how the model distinguishes different readings and interpretations . However, existing benchmark data often does not contain ambiguous examples, so one needs to build one's own experiments to explore this issue.

The traditional NLI three-way annotation scheme refers to a labeling method used for natural language inference (NLI) tasks. It requires the annotator to choose one label among three labels to represent the original text and the hypothesis. relationship between. The three labels are usually "entailment", "neutral" and "contradiction".

The authors used the format of the NLI task to conduct experiments, adopting a functional approach to characterize ambiguity through the impact of ambiguity in premises or assumptions on implication relationships. The authors propose a benchmark called AMBIENT (Ambiguity in Entailment) that covers a variety of lexical, syntactic, and pragmatic ambiguities, and more broadly covers sentences that may convey multiple different messages.

As shown in Figure 1, ambiguity can be an unconscious misunderstanding (top of Figure 1) or it can be deliberately used to mislead the audience (bottom of Figure 1). For example, if a cat gets lost after leaving home, then it is lost in the sense that it cannot find its way home (implication edge); if it has not returned home for several days, then it is lost in the sense that others cannot find it. In a sense, it is also lost (neutral side).

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Figure 1 Example of ambiguity explained by Cat Lost

AMBIENT Dataset Introduction

Selected Example

The authors provide 1645 sentence examples covering multiple types of ambiguity, including handwriting samples and from existing NLI datasets and linguistics textbooks. Each example in AMBIENT contains a set of labels corresponding to various possible understandings, and a disambiguation rewrite for each understanding, as shown in Table 1.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 1 Premises and Assumptions in Selected Examples

Generated Examples

The researchers also used overgeneration and filtering approach to build a large corpus of unlabeled NLI examples to more comprehensively cover different ambiguity situations. Inspired by previous work, they automatically identify pairs of premises that share reasoning patterns and enhance the quality of the corpus by encouraging the creation of new examples with the same patterns.

Comments and Verification

Annotations and annotations are required for the examples obtained in the previous steps. This process involved annotation by two experts, verification and summary by one expert, and verification by some authors. Meanwhile, 37 linguistics students selected a set of labels for each example and provided disambiguation rewrites. All these annotated examples were filtered and verified, resulting in 1503 final examples.

The specific process is shown in Figure 2: First, use InstructGPT to create unlabeled examples, and then two linguists independently annotate them. Finally, through integration by an author, the final annotations and annotations are obtained.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Figure 2 Annotation process of generating examples in AMBIENT

In addition, the issue of consistency of annotation results between different annotators is also discussed here, as well as AMBIENT The type of ambiguity present in the data set. The author randomly selected 100 samples in this data set as the development set, and the remaining samples were used as the test set. Figure 3 shows the distribution of set labels, and each sample has a corresponding inference relationship label. Research shows that in the case of ambiguity, the annotation results of multiple annotators are consistent, and using the joint results of multiple annotators can improve annotation accuracy.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Figure 3 Distribution of collection labels in AMBIENT

Does ambiguity illustrate "disagree"?

This study analyzes the behavior of annotators when annotating ambiguous input under the traditional NLI three-way annotation scheme. The study found that annotators can be aware of ambiguity and that ambiguity is a major cause of labeling differences, thus challenging the popular assumption that "disagreement" is the source of uncertainty in simulated examples.

In the study, the AMBIENT data set was used and 9 crowdsourcing workers were hired to annotate each ambiguous example.

The task is divided into three steps:

  1. Annotate ambiguous examples
  2. Identify possible different interpretations
  3. Annotate disambiguated examples

Among them, in step 2, the three possible explanations include two possible meanings and a similar but not identical sentence. Finally, for each possible explanation, it is substituted into the original example to obtain three new NLI examples, and the annotator is asked to choose a label respectively.

The results of this experiment support the hypothesis: under a single labeling system, the original fuzzy examples will produce highly inconsistent results, that is, in the process of labeling sentences, people are prone to ambiguous sentences. Different judgments lead to inconsistent results. However, when a disambiguation step was added to the task, annotators were generally able to identify and verify multiple possibilities for a sentence, and the inconsistencies in the results were largely resolved. Therefore, disambiguation is an effective way to reduce the impact of annotator subjectivity on the results.

Evaluate the performance on large models

Q1. Can content related to disambiguation be directly generated

The focus of this part is to test the language model to directly generate disambiguation in context and the learning ability of the corresponding label. To this end, the authors built a natural cue and validated the model's performance using automatic and manual evaluation, as shown in Table 2.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 2 A few-shot template for generating disambiguation tasks when the premise is unclear

In the test, each example has 4 other test examples serve as context, and scores and correctness are calculated using the EDIT-F1 metric and human evaluation. The experimental results shown in Table 3 show that GPT-4 performed best in the test, achieving an EDIT-F1 score of 18.0% and a human evaluation accuracy of 32.0%. In addition, it has been observed that large models often adopt the strategy of adding additional context during disambiguation to directly confirm or deny hypotheses. However, it is important to note that human evaluation may overestimate the model's ability to accurately report sources of ambiguity.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 3 Performance of large models on AMBIENT

Q2. Can the validity of reasonable explanations be identified?

This part mainly studies the performance of large models in identifying ambiguous sentences. By creating a series of templates of true and false statements and zero-shot testing the model, the researchers evaluated how well the large model performed in choosing predictions between true and false. Experimental results show that the best model is GPT-4, however, when ambiguity is taken into account, GPT-4 performs worse than random guessing in answering ambiguous interpretations of all four templates. In addition, large models have consistency problems in terms of questions. For different interpretation pairs of the same ambiguous sentence, the model may have internal contradictions.

These findings suggest that we need further research on how to improve the understanding of ambiguous sentences by large models and better evaluate the performance of large models.

Q3. Simulate open-ended continuous generation through different interpretations

This part mainly studies the ambiguity understanding ability based on language models. Language models are tested given context and compare their predictions of text continuation under different possible interpretations. In order to measure the model's ability to handle ambiguity, the researchers used KL divergence to measure the "surprise" of the model by comparing the probability and expectation differences produced by the model under a given ambiguity and a given correct context in the corresponding context. , and introduced "interference sentences" that randomly replace nouns to further test the model's ability.

The experimental results show that FLAN-T5 has the highest accuracy, but the performance results of different test suites (LS involves synonym replacement, PC involves correction of spelling errors, and SSD involves correction of grammatical structures) and different models are inconsistent, indicating that Ambiguity remains a serious challenge for models.

Multi-label NLI model experiment

As shown in Table 4, there is still much room for improvement in fine-tuning the NLI model on existing data with label changes, especially in multi-label NLI tasks. .

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 4 Performance of multi-label NLI model on AMBIENT

Detecting misleading political speech

This experiment studied Different ways of understanding political speech demonstrate that models that are sensitive to different ways of understanding can be effectively exploited. The research results are shown in Table 5. For ambiguous sentences, some explanatory interpretations can naturally eliminate the ambiguity, because these interpretations can only retain the ambiguity or clearly express a specific meaning.

Latest research, GPT-4 exposes shortcomings! Cant quite understand the language ambiguity!

▲Table 5 The political speech marked as ambiguous by the detection method of this article

In addition, the interpretation of this prediction can reveal the source of the ambiguity. By further analyzing the results of false positives, the authors also found many ambiguities that were not mentioned in fact checks, illustrating the great potential of these tools in preventing misunderstandings.

Summary

As pointed out in this article, the ambiguity of natural language will be a key challenge in model optimization. We expect that in the future technological development, natural language understanding models will be able to more accurately identify the context and key points in texts, and show higher sensitivity when processing ambiguous texts. Although we have established a benchmark for evaluating natural language processing models for identifying ambiguity and are able to better understand the limitations of models in this domain, this remains a very challenging task.

Xi Xiaoyao Technology Talk Original

Author | IQ dropped all over the place, Python

The above is the detailed content of Latest research, GPT-4 exposes shortcomings! Can't quite understand the language ambiguity!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete