Home >Technology peripherals >AI >Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

王林
王林forward
2023-04-09 13:31:061089browse

Since the advent of GPT-3, which demonstrated the power of hundreds of billions of models, NLP tasks have faced the impossible triangle of scale, samples, and fine-tuning performance. How can a language model with less than 1 billion parameters achieve SOTA's Few-Shot (or even Zero-shot) and Fine-tuning performance? Do we have to have hundreds of billions of parameters and endure unstable prompts to solve the zero-shot scenario? In this article, the IDEA Research Institute Fengshenbang team introduces a new "phenomenological" UniMC, which can achieve zero-shot SOTA with only 200 million parameters. Related work has been accepted by EMNLP 2022.

pointed out in an article [1] this year that since pre-training technology was proposed, there has been an impossible triangle in the NLP world (Figure 1 below), that is, a model cannot simultaneously satisfy :

  1. Medium model size (under 1 billion);
  2. SOTA’s Few-Shot (or even Zero-shot) performance ;
  3. SOTA’s Fine-tuning performance.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 1

The reason why the impossible triangle exists Yes, the number of parameters of the current pre-trained model only reaches a certain order of magnitude, and only when hint learning is used can strong few/zero-shot performance be demonstrated.

The paper recently published by our Fengshenbang team and included in EMNLP 2022: "Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective" breaks this "curse" and provides A flexible and efficient solution. The UniMC proposed in our paper has a very small number of model parameters (only hundreds of millions) and SOTA's fine-tuning capabilities. It can also have SOTA (equivalent to the 540 billion PaLM). Few/Zero-Shot performance.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

  • Paper address: https://arxiv.org/abs/2210.08590
  • Model open source address: https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/unimc/
Technical Background

The introduction of BERT in 2018 marked that the entire NLP field has entered a pre-training era, and NLP has finally made a further step forward. Existing models such as DeBERTa and other pre-trained masked language models (PMLM) can already achieve fine-tuning SOTA with parameters below 1 billion, but they are weak when facing NLU tasks in zero-shot scenarios.

The reason is that when using PMLM, we need to add an MLP layer on top for specific tasks, as shown in Figure 2(c). Moreover, this MLP layer will add additional parameters, which makes this method only choose random initialization when facing zero-shot scenarios, and there is no way to obtain reasonable output. Moreover, in the finetuning scenario, adding an MLP layer will also make it impossible to transfer between different tasks (for example, it is impossible to transfer between 2-classification and 3-classification tasks).

For Zero-shot scenarios, the mainstream approach in recent years is to use tens or even hundreds of billions of pre-trained language models (PLM) to uniformly convert NLU tasks into text generation tasks, so that Large models can be applied to zero-shot tasks by manually constructing prompts or manually designing verbalizers, as shown in Figure 2(a). Furthermore, in the FLAN paper, a large number of artificially constructed templates are used to unify different tasks, so that the knowledge of other tasks can be transferred to specific tasks, as shown in Figure 2(b). However, such a generative model has the following shortcomings:

  • Generating the model requires generating a verbalizer (label description), and the verbalizer is usually written manually. Different verbalizers will lead to large performance differences;
  • Prompts also require manual design. Different prompts will greatly affect the effect of downstream tasks;
  • When the generation model is inferring, it needs autoregression to generate answers, which is slow. And it is generally one-way, and cannot obtain two-way information like BERT;
  • In order to ensure few/zero-shot performance, the amount of generated model parameters is often large, reaching GPT-3 175 billion or PaLM's 540 billion;
  • Although FLAN's Instruction tuning can transfer knowledge from other tasks to specific tasks, new training is required to face different tasks. For example, when evaluating A, you need to train on BCDE; when evaluating B, you need to train on ACDE.

We proposed the UniMC method in Figure 2(d), which avoids the above problems and achieves SOTA or is comparable to the state-of-the-art in several Chinese and English tasks. Model-like performance.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 2

UniMC (a new model phenotype)

Model Ideas

Most NLU tasks are based on labels, and the generative model needs to generate labels. This is undoubtedly This increases the difficulty of the task and the learning cost of the model. For many label-based tasks, it is usually only necessary to give the input text and the probability that the output text belongs to each label. Based on this idea, we transform the NLU task into a multiple-choice task (Multiple-Choice). That is, given text, questions and options, output the probability of each option without generating the options.

Based on this, we propose a new concept: The phenotype of the model. Existing model expressions always add a certain layer later, such as a classification layer. Alternatively, the phenotype of the generated model GPT is to mine the knowledge of the model through Prompt. The UniMC solution we proposed does not require the introduction of any additional layers in PMLM and explores another phenotype of PMLM.

In this paper, we choose ALBERT as our backbone PMLM network.

Uniform multiple choice format

As shown in Figure 3, we hope to convert all label-based NLU tasks into a unified MC (Multiple-Choice) format. Our philosophy is to add as little human information as possible.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 3

Specifically, we did the following two Steps:

  • Change label to option;
  • Choose whether to add a question prompt (question basically comes from the description of the data set).

Advantages: Only one option prompt is designed, and one or no question prompt is designed.

Model structure

The structure of UniMC is shown in Figure 4 below. It uses self-encoding similar to BERT structure. The main process is that we first unify the inputs of different tasks and limit the flow of input information. After PMLM, we use O-MLM, OP and MLM for MC training, and finally use O-MLM and OP for zero- shot prediction. Next I will break down our solution step by step.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 4

##Input Input

As shown in Figure 5, the content of the red solid line box area. Before inputting to UniMC, it needs to be processed and turned into UniMC's unique token format. In order to improve calculation efficiency, we directly splice all options with questions and text, that is, [Options, Question, Passage]. And we insert a special token in front of each option, [O-MASK], to indicate yes or no (select this option or not). (Note, in order to improve reusability, we reused the [MASK] token.

As shown in Figure 5, the content of the green dotted box area. We need to consider that there are too many input information sources and there are options Information, question information and text segment information. The information between them will affect each other, so we hope to isolate different information. For example, if we can see other options when typing, then the difficulty of this question will decrease , the model will be inert.

So we made the following considerations:

  • Use Segment ID to tell the model option and context (question, passage) information is different;
  • Modify the Postion ID, the model needs to treat the location information of different options equally;
  • Modify Attention Mask matrix prevents the model from seeing information about different options, causing the model to become inert.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 5

How does the model do multiple choice questions? (O-MLM and OP)

As shown in Figure 6, we use O -MLM and OP tasks to allow the model to "select" the answer. O-MASK is completely inherited from the MASK token (specifically, in order not to add additional parameters and make full use of the knowledge learned by the model in the unsupervised pre-training stage, we Reuses the parameters of the MaskLM head). The only difference is that it is 100% masked. The goal of the O-MLM task is to decode the O-MASK into 'yes' or 'no', which is used to predict whether the option is selected.

The role of the OP task is to predict the answer from the 'yes' of each option. Specifically, we take the 'yes' of each [O-MASK] output Use logit to perform softmax to get the probability of each option, and choose the option with the highest probability as the predicted answer.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 6

Processing multiple MC tasks in one Batch

As shown in Figure 7, we hope to process multiple MC tasks in one batch Putting multiple MC data sets into it can enhance the capabilities of the model and make it more unified. When we were building the batch, we discovered a problem: What if there are samples with different options in a batch?

So we designed a logit mask method in front of the output. By directly assigning a negative infinity predicted value to irrelevant tokens, and adding them up, we can eliminate the impact of other tokens on O-MASK when calculating softmax. Moreover, different numbers of multiple-choice questions can be processed uniformly in one batch.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 7Model training and prediction

MC Training

Different from FLAN's Instruction Tuning, we only train on the MC data set. This is mainly to allow the model to learn how to do multiple-choice questions, and the MC data set has certain versatility, such as different data Sets may consist of varying numbers of tags.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 8

Zero-shot Inference

Interestingly, we can find that these two tasks can be consistent in the two stages of training and zero-shot inference. This is because we use two tasks, O-MLM and OP, to let the model do multiple-choice questions. And since we abandoned the classification layer, all parameters can be reused, thus activating the Zero-shot capability of PMLM.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 9UniMC Performance

English scenario

#We collected 14 multiple-choice tasks for pre-training, and then performed other NLU tasks for zero-shot performance testing. In 4 NLI tasks, UniMC achieves SOTA and surpasses the 540 billion parameter PaLM model.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 10

And we

Defeated the network with GPT-2 and GPT-3 as its backbone in the classification task. For the very difficult Dbpedia task, up to 13 categories, an even ultra-high accuracy of 88.9% can be achieved.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

Figure 11

#In order to explore the generalization of UNIMC, we Comparison was made with FLAN. As can be seen, our UniMC can surpass or come close to FLAN in almost all tasks.

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Picture 12

Chinese scene

In the Chinese scenario, we collected 40 supervised data sets and unified them into MC task forms to pre-train the UniMC model, and then performed 9 tasks on FewCLUE and ZeroCLUE Test on. As of August 30, 2022,

UniMC has ranked first in both FewCLUE and ZeroCLUE lists (Erlangshen in the picture - UnifiedMC is UniMC).

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models##Figure 13

Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models

##Figure 14

SummaryWe proposed a novel solution to the NLU task in the Zero-shot scenario , using only hundreds of millions of parameters, it defeated a complex large model with a thousand times the number of parameters.

In addition, we introduce almost no artificial information. And it overcomes the problem of inconsistency between pre-training and fine-tuning of BERT-type models, and our training and prediction are consistent. We can even perform one training and multiple zero-shot predictions, which greatly saves computing power costs. Currently, the IDEA Fengshenban team has launched more than 70 pre-trained large models.

  • Model: https://huggingface.co/IDEA-CCNL
  • Fengshenlist Overall thesis (bilingual in Chinese and English): https://arxiv.org/abs/2209.02970
  • Fengshenbang homepage: https://github.com/IDEA- CCNL/Fengshenbang-LM

citation

##[1]Impossible Triangle: What's Next for Pre-trained Language Models?https://readpaper.com/paper/4612531641570566145

The above is the detailed content of Breaking the Impossible Triangle and competing with 540 billion models, the IDEA Fengshen List team only achieves zero-sample learning SOTA with 200 million models. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete