Home > Article > Technology peripherals > Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models
Since the advent of GPT-3, which demonstrated the power of hundreds of billions of models, NLP tasks have faced the impossible triangle of scale, samples, and fine-tuning performance. How can a language model with less than 1 billion parameters achieve SOTA's Few-Shot (or even Zero-shot) and Fine-tuning performance? Do we have to have hundreds of billions of parameters and endure unstable prompts to solve the zero-shot scenario? In this article, the IDEA Research Institute Fengshenbang team introduces a new "phenomenological" UniMC, which can achieve zero-shot SOTA with only 200 million parameters. Related work has been accepted by EMNLP 2022.
pointed out in an article [1] this year that since pre-training technology was proposed, there has been an impossible triangle in the NLP world (Figure 1 below), that is, a model cannot simultaneously satisfy :
##Figure 1
The reason why the impossible triangle exists Yes, the number of parameters of the current pre-trained model only reaches a certain order of magnitude, and only when hint learning is used can strong few/zero-shot performance be demonstrated.
The paper recently published by our Fengshenbang team and included in EMNLP 2022: "Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective" breaks this "curse" and provides A flexible and efficient solution. The UniMC proposed in our paper has a very small number of model parameters (only hundreds of millions) and SOTA's fine-tuning capabilities. It can also have SOTA (equivalent to the 540 billion PaLM). Few/Zero-Shot performance.
The introduction of BERT in 2018 marked that the entire NLP field has entered a pre-training era, and NLP has finally made a further step forward. Existing models such as DeBERTa and other pre-trained masked language models (PMLM) can already achieve fine-tuning SOTA with parameters below 1 billion, but they are weak when facing NLU tasks in zero-shot scenarios.
The reason is that when using PMLM, we need to add an MLP layer on top for specific tasks, as shown in Figure 2(c). Moreover, this MLP layer will add additional parameters, which makes this method only choose random initialization when facing zero-shot scenarios, and there is no way to obtain reasonable output. Moreover, in the finetuning scenario, adding an MLP layer will also make it impossible to transfer between different tasks (for example, it is impossible to transfer between 2-classification and 3-classification tasks).
For Zero-shot scenarios, the mainstream approach in recent years is to use tens or even hundreds of billions of pre-trained language models (PLM) to uniformly convert NLU tasks into text generation tasks, so that Large models can be applied to zero-shot tasks by manually constructing prompts or manually designing verbalizers, as shown in Figure 2(a). Furthermore, in the FLAN paper, a large number of artificially constructed templates are used to unify different tasks, so that the knowledge of other tasks can be transferred to specific tasks, as shown in Figure 2(b). However, such a generative model has the following shortcomings:
We proposed the UniMC method in Figure 2(d), which avoids the above problems and achieves SOTA or is comparable to the state-of-the-art in several Chinese and English tasks. Model-like performance.
Figure 2
Model Ideas
Most NLU tasks are based on labels, and the generative model needs to generate labels. This is undoubtedly This increases the difficulty of the task and the learning cost of the model. For many label-based tasks, it is usually only necessary to give the input text and the probability that the output text belongs to each label. Based on this idea, we transform the NLU task into a multiple-choice task (Multiple-Choice). That is, given text, questions and options, output the probability of each option without generating the options.
Based on this, we propose a new concept: The phenotype of the model. Existing model expressions always add a certain layer later, such as a classification layer. Alternatively, the phenotype of the generated model GPT is to mine the knowledge of the model through Prompt. The UniMC solution we proposed does not require the introduction of any additional layers in PMLM and explores another phenotype of PMLM.
In this paper, we choose ALBERT as our backbone PMLM network.
Uniform multiple choice format
As shown in Figure 3, we hope to convert all label-based NLU tasks into a unified MC (Multiple-Choice) format. Our philosophy is to add as little human information as possible.
##Figure 3
Specifically, we did the following two Steps:
Advantages: Only one option prompt is designed, and one or no question prompt is designed.
Model structure
The structure of UniMC is shown in Figure 4 below. It uses self-encoding similar to BERT structure. The main process is that we first unify the inputs of different tasks and limit the flow of input information. After PMLM, we use O-MLM, OP and MLM for MC training, and finally use O-MLM and OP for zero- shot prediction. Next I will break down our solution step by step.
Figure 4
##Input Input
As shown in Figure 5, the content of the red solid line box area. Before inputting to UniMC, it needs to be processed and turned into UniMC's unique token format. In order to improve calculation efficiency, we directly splice all options with questions and text, that is, [Options, Question, Passage]. And we insert a special token in front of each option, [O-MASK], to indicate yes or no (select this option or not). (Note, in order to improve reusability, we reused the [MASK] token.
As shown in Figure 5, the content of the green dotted box area. We need to consider that there are too many input information sources and there are options Information, question information and text segment information. The information between them will affect each other, so we hope to isolate different information. For example, if we can see other options when typing, then the difficulty of this question will decrease , the model will be inert.
So we made the following considerations:
Figure 5
How does the model do multiple choice questions? (O-MLM and OP)
As shown in Figure 6, we use O -MLM and OP tasks to allow the model to "select" the answer. O-MASK is completely inherited from the MASK token (specifically, in order not to add additional parameters and make full use of the knowledge learned by the model in the unsupervised pre-training stage, we Reuses the parameters of the MaskLM head). The only difference is that it is 100% masked. The goal of the O-MLM task is to decode the O-MASK into 'yes' or 'no', which is used to predict whether the option is selected.
The role of the OP task is to predict the answer from the 'yes' of each option. Specifically, we take the 'yes' of each [O-MASK] output Use logit to perform softmax to get the probability of each option, and choose the option with the highest probability as the predicted answer.
Figure 6
Processing multiple MC tasks in one Batch
As shown in Figure 7, we hope to process multiple MC tasks in one batch Putting multiple MC data sets into it can enhance the capabilities of the model and make it more unified. When we were building the batch, we discovered a problem: What if there are samples with different options in a batch?
So we designed a logit mask method in front of the output. By directly assigning a negative infinity predicted value to irrelevant tokens, and adding them up, we can eliminate the impact of other tokens on O-MASK when calculating softmax. Moreover, different numbers of multiple-choice questions can be processed uniformly in one batch.
##Figure 7Model training and prediction
MC Training Different from FLAN's Instruction Tuning, we only train on the MC data set. This is mainly to allow the model to learn how to do multiple-choice questions, and the MC data set has certain versatility, such as different data Sets may consist of varying numbers of tags. ##Figure 8 Zero-shot Inference Interestingly, we can find that these two tasks can be consistent in the two stages of training and zero-shot inference. This is because we use two tasks, O-MLM and OP, to let the model do multiple-choice questions. And since we abandoned the classification layer, all parameters can be reused, thus activating the Zero-shot capability of PMLM. ##Figure 9UniMC Performance English scenario
Figure 10 Defeated the network with GPT-2 and GPT-3 as its backbone in the classification task. For the very difficult Dbpedia task, up to 13 categories, an even ultra-high accuracy of 88.9% can be achieved. Figure 11
##Picture 12 Chinese scene In the Chinese scenario, we collected 40 supervised data sets and unified them into MC task forms to pre-train the UniMC model, and then performed 9 tasks on FewCLUE and ZeroCLUE Test on. As of August 30, 2022, UniMC has ranked first in both FewCLUE and ZeroCLUE lists (Erlangshen in the picture - UnifiedMC is UniMC). ##Figure 13
##Figure 14 SummaryWe proposed a novel solution to the NLU task in the Zero-shot scenario , using only hundreds of millions of parameters, it defeated a complex large model with a thousand times the number of parameters. In addition, we introduce almost no artificial information. And it overcomes the problem of inconsistency between pre-training and fine-tuning of BERT-type models, and our training and prediction are consistent. We can even perform one training and multiple zero-shot predictions, which greatly saves computing power costs. Currently, the IDEA Fengshenban team has launched more than 70 pre-trained large models. citation ##[1]Impossible Triangle: What's Next for Pre-trained Language Models?https://readpaper.com/paper/4612531641570566145
The above is the detailed content of Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models. For more information, please follow other related articles on the PHP Chinese website!