Home > Article > Technology peripherals > Can’t the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient
The natural language processing (NLP) model cannot read human speech and interprets text as the opposite meaning, which is a chronic problem in the industry. Now Microsoft says it has developed a solution to this problem.
It can be used as a large-scale model across various application foundations, or the progress of platform models has greatly improved the natural processing of AI. language ability. But natural language processing (NLP) models are still far from perfect, and flaws can sometimes be exposed in embarrassing ways.
For example, there is a top-notch commercial model that translates "I do not recommend this dish" in Portuguese into "I highly recommend this dish" in English.
These failures continue in part because finding and fixing bugs in NLP models is so difficult that serious bugs affect nearly all major open source and commercial NLP models. There are currently two methods for finding and fixing NLP model errors: either user-driven or automatic.
The user-driven approach is flexible and can test any aspect of NLP model behavior. But this method relies on humans' extremely variable imagination and ability to identify errors, and is extremely labor-intensive, so that in practice only a small amount of input data can be used for testing.
On the other hand, automatic methods are fast and therefore can handle a large portion of the input data. However, lacking human control, they can only test whether a model is right or wrong in very limited circumstances, such as when the model processes input wording that changes slightly and its predictions become inconsistent.
Microsoft researchers believe that modern large language models (LLMs) like GPT-3 provide an opportunity for the industry to try to combine user-driven methods and The advantages of automated approaches combine to let the user define what the model under test should do, while leveraging the generation capabilities of modern large-scale language models to generate large-scale tests in specific categories of model behavior.
Microsoft researchers call this kind of human-machine integration path "adaptive testing and debug", abbreviated as AdaTest. With AdaTest, a large language model is given the heavy burden of generating a large number of tests for errors in the model under test.
Human intervention guides the generation of language models by selecting effective tests and organizing them into semantically related topics. This kind of human guidance greatly improves the generation performance of the language model and guides it to the target domain.
Because these tests are actually a form of labeled data, they can not only identify errors in NLP models, but can also be used to fix NLP model errors in an iterative debugging cycle similar to traditional software development.
AdaTest provides significant efficiency improvements for professional users, while being simple enough for ordinary people without a programming background to use effectively. This means that both professional users and ordinary users can better understand and control the behavior in a series of scenarios of the NLP model, which not only makes the AI system perform better, but also makes the AI system respond to user needs more effectively.
The AdaTest mode consists of an internal test loop and an external debugging loop. The former is used to find errors and the latter is used to fix errors.
Although this task seems simple, even SOTA models on the market often make mistakes. For example, some SOTA models will classify the double negative sentence "I don't think I have had a better time in my life" as emotionally negative, or the sentence "I am a minority" will be classified as emotionally negative. Emotionally negative.
Both of these situations are mistakes that have actually occurred in business models on the market. To prove that AdaTest can find and fix bugs, Microsoft's research team demonstrated how to test and fix text fairness lapses in NLP models.
The text fairness error of the NLP model, that is, the neutral description of a specific attribute group in a piece of text, may lead to errors in the text sentiment analysis function of the NLP model and mistakenly reduce the emotional weight of the text. That is, the model may treat descriptions of certain groups more negatively.
In the test loop, Microsoft researchers started with a set of text unit tests for various identities and marked this set of tests as "sensitive." These initial examples did not reveal any errors in the model.
However, the AdaTest method uses GPT-3 to generate a large number of suggestive tests with similar corpus to highlight the hidden bugs in the test object model.
Although hundreds of tests are generated, the intervening personnel only need to review the first few tests that are erroneous or close to being erroneous. Human intervention then ignores those test results that are not actually wrong and adds other valid test results to the current topic, and occasionally organizes them into other subtopics. These manually filtered test results will be included in the next In the language model prompt of round input, the processing results of the next set of input data are pushed to the intersection between user concerns and model errors.
Repeating this internal testing cycle allows the NLP model to start with no errors and slowly expose more and more obvious errors and bugs. Therefore, even if users cannot find faults in the model themselves, they can start with a small set of passing tests and then quickly iterate with the NLP model to produce a large batch of tests that reveal errors in the model under test.
Internal test loop example If the tester does not use the topic of text sentiment analysis, but focuses on a different topic, such as processing negative sentences and double negative sentences, the tester will Different faults are found.
For example, a simple statement like "I have never been happier than I am now" can be correctly classified by the business model as positive. However, using the AdaTest method, you can quickly find that complex statements like "I don't think I have ever seen a better city" will be incorrectly marked as negative by the NLP model.
Once a tester sees these errors, they will be obvious and egregious, but they are difficult to detect directly by humans because they only occur in very specific wordings. Microsoft's research team conducted a user survey to quantitatively evaluate whether AdaTest enables professional and non-professional users to better write tests and find errors in NLP models. The researchers asked professional users to test topic-specific features in two models: a commercial text sentiment classifier and GPT-2 for next-word autocompletion.
This function is used for applications such as predicting the next word in an email being entered. For each topic and model, participants were randomly assigned to use CheckList (which stands for SOTA for User-Driven Testing) or AdaTest. The researchers observed a fivefold improvement in AdaTest across different models and professional participants.
The researcher’s testing requirement for non-professional users is to test the content control of toxic corpus in the NLP model. Participants have to find non-toxic content in the corpus that is judged as toxic by the model, that is, content that they personally feel is suitable. Participants can use an improved version of Dynabench crowdsourcing interface for model testing, or they can use AdaTest. The result is that AdaTest provides up to 10x improvement.
Test renderings of test participants with different viewpoints
Once enough errors are found , the tester of the model will perform an external debugging loop (as shown below), fix the errors found in the test loop, and then retest the model. In this process, the "retest" part of the debug loop (i.e., running the test loop again) is crucial because once the tests are used to fix the model, they are no longer test data, but training data. The process of fixing bugs often overcompensates, introducing shortcuts or bugs in the first few rounds of the debugging cycle that can only be discovered with a set of tests adapted to the new "fixed" model.
Testing cycle process on an open source RoBERTa-Large emotion model. The researchers started with testing on the “/sensitive/immigration” topic in Figure 2, which the RoBERTa model incorrectly labeled as negative. The model is fine-tuned during these tests (mixed with the original training data to maintain task performance), and the result is a new model that no longer fails. However, when re-running the test loop, it was discovered that almost all immigration statements were now marked as "neutral" even though they were truly negative based on the application and test scenario.
Using these new tests to fine-tune again, the result is that the model correctly fixes the original error without adding the "every immigration statement is neutral" shortcut. Of course, this does not guarantee that another shortcut does not exist in the model, but according to the researcher's experience, after several debugging cycles, the number of unexpected errors introduced when fixing the original error is greatly reduced.
Testers do not need to identify every possible error in detail in advance, AdaTest will adaptively display and fix errors introduced in the next round of testing and debugging.
Thus, the debugging loop pushes the boundaries of the current bug testing specification until a satisfactory model is produced. In fact, AdaTest can be seen as the application of the test-fix-retest cycle in software engineering in NLP.
Shortcuts added during iterations of the debug loop were discovered and fixed by future iterations To evaluate the effectiveness of the debug loop, we used the Quora question dataset on RoBERTa -Large was fine-tuned to detect whether two questions are duplicates and also fine-tuned using the Stanford Sentiment Treebank (SST) dataset for positive/neutral/negative sentiment analysis.
The results found that the baseline model failed to successfully identify 22 of the 53 QQP topics and 11 of the 39 emotional topics. Afterwards, the researcher created data to repair the theme. Extract 50 examples from the data on this topic and run a debugging loop with AdaTest. On the QQP data set, an average of 41.6 tests are performed, and on the sentiment data set, an average of 55.8 tests are performed.
The results show that in the vast majority of cases, AdaTest repairs the questions used for training and some unseen reserved questions without destroying any questions, while the original CheckList data often introduces new errors, thus Destroy other test questions. The researchers also evaluated AdaTest's effectiveness in a standard development environment. After three months of development, CheckList testing and ad hoc data augmentation based on GPT-3, the F1 score is 0.66 (out of 1.00) on unseen data collected in the wild.
The same team using AdaTest achieved an F1 score of 0.77 on the same unseen data set after running the debug loop themselves for four hours. These scores were later replicated on a second, unseen data set, demonstrating that AdaTest can perform bug fixes and achieve better results in areas where traditional methods fail.
People provide problem specifications that language models lack, while language models provide high-quality testing at a greater scale and scope, and connect model testing and debugging to effectively fix errors and enable model development A step closer to the iterative nature of traditional software development.
The cooperation between humans and AI represents a future direction for the development of machine learning. It is hoped that this collaboration will continue to improve as the capabilities of large-scale language models continue to grow.
The above is the detailed content of Can’t the NLP model read human language? Microsoft AdaTest makes fault-finding five times more efficient. For more information, please follow other related articles on the PHP Chinese website!