Home > Article > Technology peripherals > Completely blasting GPT3 and Google PaLM! Retrieval enhanced model Atlas refreshes knowledge-based small sample tasks SOTA
Unconsciously, large models with small samples have become the mainstream approach in the field of small sample learning. In many task contexts, a common idea is to first label small data samples, and then start from pre-training large samples. The model is trained based on small data samples. Although as we have seen, large models have achieved amazing results on a wide range of small sample learning tasks, it also naturally puts some of the inherent shortcomings of large models in the spotlight of small sample learning.
Small sample learning expects the model to have the ability to complete independent reasoning based on a small number of samples. In other words, the ideal model should master problem-solving ideas by solving problems, so as to face new emerging problems. Questions can draw inferences from one instance to another. However, the ideal and practical learning ability of large models and small samples seems to rely on the large amount of information stored during the training of large models to memorize the process of solving a problem. Although it is extremely brave on various data sets, it will always give People are confused. Is a student who studies in this way really a potential student?
The paper introduced today by Meta AI takes a new approach and applies the retrieval enhancement method to the field of small sample learning. Not only With only 64 examples, it achieved an accuracy of 42% on the Natural Questions data set (Natural Questions). It also compared the large model PaLM to reduce the number of parameters by 50 times (540B->11B), and improved the interpretability. , controllability, updateability and other aspects have significant advantages that other large models do not have.
Paper title:Few-shot Learning with Retrieval Augmented Language ModelsPaper link:https://arxiv.org/pdf/2208.03299.pdf
The beginning of the paper , and asked everyone a question: "In the field of small sample learning, is it really necessary to use a huge number of parameters to store information?" Looking at the development of large models, successive large models can continue to work on SOTA. One of the reasons is that its huge parameters store the information needed for the problem. Since the birth of Transformer, large models have been the mainstream paradigm in the field of NLP. With the gradual development of large models, "big" problems are constantly exposed, and it is quite meaningful to ask about the necessity of "big". The author of the paper starts from Starting from this question, a negative answer is given to this question, and the method is to retrieve the enhanced model.
Enhanced traceability retrieval. In fact, although its technology is mainly used in tasks such as open domain question answering, machine reading, and text generation, retrieval The idea of reinforcement can be traced back to the RNN era of NLP. The shortcoming of the RNN model that cannot solve the long-term dependence of data has prompted researchers to widely explore solutions. The Transformer, which we are quite familiar with, uses the Attention mechanism to effectively solve the problem of the model's inability to remember, thus opening the door to pre-training large models. era.
At that time, there was actually another way, which was Cached LM. Its core idea is that since RNN may not be able to remember it as soon as it enters the exam room, then simply let it In the RNN open-book exam, the Cache mechanism is introduced to store the words predicted during training in the Cache. During prediction, the information from both query and cache index can be combined to complete the task, thereby solving the shortcomings of the RNN model at that time.
As a result, retrieval enhancement technology has embarked on a completely different path from large models that rely on parameter memory information. The model based on retrieval enhancement allows the introduction of external knowledge from different sources, and these retrieval sources include training corpus, external data, unsupervised data and other options. Retrieval enhancement models generally consist of a retriever and a generator. The retriever obtains relevant knowledge from external retrieval sources based on the query, and the generator combines the query with the retrieved relevant knowledge to perform model predictions.
In the final analysis, the goal of the retrieval-enhanced model is to expect the model to not only learn to remember the data, but also to learn to find the data on its own. This feature has great advantages in many knowledge-intensive tasks and the retrieval-enhanced model also Great success has been achieved in these areas, but whether retrieval enhancement is suitable for few-shot learning is unknown. Going back to this paper in Meta AI, we successfully tested the application of retrieval enhancement in small sample learning, and Atlas came into being.
Atlas has two sub-models, a retriever and a language model. When faced with a task, Atlas uses a searcher to generate the most relevant top-k documents from a large amount of corpus based on the input question, and then puts these documents into the language model together with the question query to generate the required Output.
The basic training strategy of the Atlas model is to jointly train the retriever and the language model using the same loss function. Both the retriever and the language model are based on the pre-trained Transformer network, where:
It is worth noting that the author compared and tested four loss functions and the situation without joint training of the retriever and language model. The results are as follows:
It can be seen that in a small sample environment, the accuracy obtained by using the joint training method is significantly higher than that without joint training. Therefore, the author It is concluded that this joint training of the retriever and the language model is the key to Atlas' small-shot learning capabilities.
In the large-scale multi-task language understanding task (MMLU), compared with other models, Atlas has only 11B parameters. In this case, it has a better accuracy rate than GPT-3, which has 15 times the number of parameters of Atlas. After the introduction of multi-task training, the accuracy rate in the 5-shot test is even close to that of Gopher, which has 25 times the number of parameters of Atlas.
In the two test data of open domain question answering-NaturalQuestions and TriviaQA, the performance of Atlas and other models on 64 examples were compared. And the performance on the full training set is shown in the figure below. Atlas achieved a new SOTA in 64-shot, achieving an accuracy of 84.7% on TrivuaQA using only 64 data.
In the fact-checking task (FEVER), Atlas also performed significantly better on small samples than Gopher and Gopher, which had dozens of times the number of parameters as Atlas. ProoFVer, outperformed Gopher by 5.1% in the 15-shot task.
On KILT, the self-published benchmark for knowledge-intensive natural language processing tasks, the accuracy of Atlas trained using 64 samples in some tasks It is even close to the accuracy obtained by other models using full samples. After using full samples to train Atlas, Atlas refreshed the SOTA on five data sets.
According to the research in this paper, the retrieval enhancement model not only takes into account smaller and better, but also In terms of interpretability, it also has significant advantages that other large models do not have. The black-box nature of large models makes it difficult for researchers to use large models to analyze the model's operating mechanism. However, the retrieval-enhanced model can directly extract the retrieved documents, so that by analyzing the articles retrieved by the retrieval, we can obtain insights into Atlas work. Better understanding. For example, the paper found that in the field of abstract algebra, 73% of the model's corpus relied on Wikipedia, while in ethics-related fields, only 3% of the documents extracted by the searcher came from Wikipedia, which is consistent with human intuition. As shown in the statistical chart on the left side of the figure below, although the model prefers to use CCNet data, in STEM fields that focus more on formulas and reasoning, the usage rate of Wikipedia articles has increased significantly.
According to the statistical chart on the right side of the above figure, the author found that as the number of retrieved articles containing correct answers increases, the model is more accurate The accuracy rate is also rising. When the article does not contain the answer, it is only 55% correct. When the answer is mentioned more than 15 times, the accuracy rate reaches 77%. In addition, when manually inspecting the documents retrieved by 50 search engines, it was found that 44% of them contained useful background information. Obviously, these materials containing background information on issues can provide researchers with great opportunities to expand their reading. help.
Generally speaking, we tend to think that large models have the risk of "leakage" of training data, that is, sometimes the answers of large models to test questions are not based on the learning ability of the model but on the large model. The memory ability of the model means that the answers to the test questions are leaked in the large amount of corpus learned by the large model. In this paper, after the author manually eliminated the corpus information that may have been leaked, the model accuracy dropped from 56.4%. It reached 55.8%, a decrease of only 0.6%. It can be seen that the retrieval enhancement method can effectively avoid the risk of model cheating.
Finally, updateability is also a unique advantage of the retrieval enhancement model. The retrieval enhancement model can be updated from time to time without retraining, but only by updating or replacing the corpus it relies on. By constructing a time series data set, as shown in the figure below, without updating the Atlas parameters, the author achieved an accuracy of 53.1% just by using the 2020 corpus Atlas. What is interesting is that even with the 2020 data fine-tuning T5, T5 also did not perform very well. The author believes that the reason is largely due to the fact that the data used in the pre-training of T5 is data before 2020.
We can imagine that there are three students, and one student only relies on rote memorization to solve the problem. , the answer to a math problem can be recited accurately. One student relies on looking up books. When encountering a problem, he will not first search for the information to find the most suitable one and then answer one by one. The last student is talented and smart and can learn simply. With some knowledge in textbooks, you can confidently go to the examination room to give pointers.
Obviously, the ideal of small sample learning is to become the third student, but the reality is likely to stay above the first student. Large models are easy to use, but "big" is by no means the ultimate goal of the model. Returning to the original intention of small sample learning to expect the model to have reasoning judgment and the ability to draw inferences similar to humans, then we can see that this paper is from a different perspective. It would be good to take a step forward, at least to make it easier for the student not to load so much potentially redundant knowledge in his head, but to pick up a textbook and travel lightly. Perhaps even allowing students to take open-book exams with the textbook for constant review , and it will be closer to intelligence than students memorizing by rote!
The above is the detailed content of Completely blasting GPT3 and Google PaLM! Retrieval enhanced model Atlas refreshes knowledge-based small sample tasks SOTA. For more information, please follow other related articles on the PHP Chinese website!