Home  >  Article  >  Technology peripherals  >  Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

WBOY
WBOYforward
2023-04-08 12:31:051459browse

Recently, there has been great interest in the powerful capabilities demonstrated by large language models (such as thought chains[2], scratch pads[3]), and a lot of work has been carried out. We collectively refer to these as the emergent capabilities of large models [4]. These capabilities may [5] only exist in large models but not in smaller models, so they are called “emergent”. Many of these capabilities are very impressive, such as complex reasoning, knowledge reasoning, and out-of-distribution robustness, which we will discuss in detail later.

Notably, these capabilities are close to what the NLP community has been seeking for decades, and thus represent a potential research paradigm shift away from fine-tuning small models. to using large models for contextual learning. For first movers, the paradigm shift may be obvious. However, for the sake of scientific rigor, we do need very clear reasons why one should move to large language models, even if these models are expensive[6] and difficult to use[7 ], and the effect may be average[8]. In this article, we will take a closer look at what these capabilities are, what large language models can provide, and what their potential advantages are in a wider range of NLP/ML tasks.

##Original link: yaofu.notion.site/A-Closer-Look-at-Large-Language-Models-Emergent-Abilities-493876b55df5479d80686f68a1abd72f

Contents

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

Prerequisite: We assume that the reader has the following knowledge:

  • Pre-training, fine-tuning, prompts (natural language processing/deep learning capabilities that ordinary practitioners should have)
  • Thinking chain prompts, scratch pads (ordinary practitioners may not Too understanding, but it does not affect reading)
1. The emergent ability that exists in large models rather than small models

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

Image from Wei. et. al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. The X-axis is the model size. GSM8K is a collection of elementary school-level mathematics problems.

In the above renderings, we can observe the performance of the model:

  • When the size When the model is relatively small, the improvement is not big
  • When the model becomes larger, there is a significant improvement

This fundamentally shows that , some capabilities may not exist in the small model but are acquired in the large model.

There are many kinds of emergent capabilities, such as those sorted out by Wei et al. in 2022[9]. Some abilities are interesting, but we will not discuss them in this article, such as spelling the last letters of a string of words. We think this is a task for Python rather than a language model; or 3-digit addition, we think it is a calculation. This is what the processor does instead of the language model.

In this article, we are mainly interested in the following capabilities:

1. The NLP community has paid attention to it in recent years, but the previous NLP Capabilities that are difficult for models to achieve

2. Capabilities derived from the deepest essence of human language (depth of capabilities)

3. Ability that may reach the highest level of human intelligence (the upper limit of ability)

2. Three typical examples of emergent abilities

Many interesting abilities can be classified as above Among the categories mentioned in the article, among them, we mainly discuss the following three typical abilities:

  • Complex reasoning
  • Knowledge reasoning
  • Out-of-distribution robustness

Let us discuss each in detail next.

Complex Reasoning

The following is an example in the GSM8K data set where using prompt words significantly exceeds fine tuning:

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

While this question is easy for a 10-year-old, it is difficult for a language model, mainly due to the mix of math and language.

GSM8K was originally proposed by OpenAI in October 2021 [10]. At that time, they used the first version of [11]GPT3 to fine-tune the entire training set, with an accuracy of about 35%. This result makes the authors quite pessimistic, because their results show the scaling law of language models: as the model size increases exponentially, the performance increases linearly (I will discuss this later). Therefore, they ponder in Section 4.1:

"The 175B model appears to require at least an additional two orders of magnitude of training data to achieve an 80% solution rate."

Three months later, in January 2022, Wei et al. [12] Based on the 540BPaLM model, only used 8 thought chain prompts The example improves the accuracy to 56.6% (without increasing the training set by two orders of magnitude). Later in March 2022, Wang et al.[13] based on the same 540B PaLM model and improved the accuracy to 74.4% through the majority voting method. The current SOTA comes from my own work on AI2 (Fu et. al. Nov 2022[14]), where we achieved 82.9% accuracy on 175B Codex by using complex thought chains. As can be seen from the above progress, technological progress is indeed growing exponentially.

Thinking chain prompt is a typical example showing the emergent capabilities of a model as it scales:

  • From emergent capabilities Let’s look at : Only when the model is larger than 100B can the effect of the thinking chain be greater than the only answer prompt. So this ability only exists in large models.
  • From the effect point of view: The performance of the thought chain prompt is significantly better than its previous fine-tuning[15]method.
  • From the perspective of annotation efficiency: Thought chain prompts only require annotations of 8 examples, while fine-tuning requires a complete training set.

Some students may think that models that can do primary school mathematics mean nothing (in a sense, they are really not that cool). But GSM8K is just the beginning, and recent work has pushed cutting-edge problems to high school[16], universities[17], and even International Mathematical Olympiad problems[18] . Is it cooler now?

Knowledge Reasoning

The next example is reasoning skills that require knowledge (such as question and answer and common sense reasoning). In this case, prompting a large model is not necessarily better than fine-tuning a small model (which model is better remains to be seen). But the annotation efficiency in this case is amplified because:

  • #In many datasets, in order to obtain the required background/common sense knowledge, the (previously small) model An external corpus/knowledge graph is needed to retrieve[19], or training on enhanced[20] data is required through multi-task learning
  • For large language models, you can directly remove the retriever[21] and rely only on the internal knowledge of the model[22] without the need for fine-tuning

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

Image from Yu et. al. 2022. Previous SOTA models needed to be retrieved from external knowledge sources. GPT-3 performs equally well/better than previous models without retrieval.

#As shown in the table, unlike the math problem example, GPT-3 does not significantly outperform the previous fine-tuned model. But it does not need to be retrieved from external documents, it itself contains knowledge[23].

To understand the significance of these results, we can look back at history: the NLP community has faced the challenge of how to effectively encode knowledge from the beginning. People are constantly exploring ways to store knowledge outside or inside the model. Since the 1990s, people have been trying to record the rules of language and the world in a giant library, storing knowledge outside the model. But this is very difficult, after all, we cannot exhaust all the rules. Therefore, researchers began to build domain-specific knowledge bases to store knowledge in the form of unstructured text, semi-structured (such as Wikipedia) or fully structured (such as knowledge graphs). Generally, structured knowledge is difficult to construct (because the structural system of knowledge needs to be designed), but easy to reason (because of the architecture), unstructured knowledge is easy to construct (just save it directly), but it is difficult to use for reasoning (no architecture). However, language models provide a new way to easily extract knowledge from unstructured text and reason based on the knowledge efficiently without the need for predefined patterns. The following table compares the advantages and disadvantages:

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

Out-of-distribution robustness

# # The third capability we discuss is out-of-distribution robustness. Between 2018 and 2022, there was a lot of research on distribution shift/adversarial robustness/combination generation in the fields of NLP, CV and general machine learning. It was found that when the test set distribution is different from the training distribution, the behavioral performance of the model may be will drop significantly. However, this does not seem to be the case in context learning of large language models. The research by Si et al[24] in 2022 shows:

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

##Data comes from Si et. al. 2022. Although GPT-3 is worse than RoBERTa in the identically distributed setting, it is better than RoBERTa in the non-identically distributed setting, and the performance drop is significantly smaller.

#Similarly, in this experiment, the effect of GPT-3 based on prompt words under the same distribution is not as good as that of fine-tuned RoBERTa. But it outperforms RoBERTa in three other distributions (domain switching, noise, and adversarial perturbations), which means GPT3 is more robust.

In addition, even if there is a distribution shift, the generalization performance brought by good prompt words will still be maintained. For example:

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

The picture comes from Fu et. al. 2022. Even if the test distribution is different from the training distribution, complex cues are always better than simple ones. Hints perform better.

Fu et al.’s 2022 study

[25] showed that the more complex the input prompts, the better the performance of the model. This trend also continued in the case of distribution shifts: complex cues always outperformed simple cues, whether the test distribution was different from the original distribution, came from a noise distribution, or was transferred from another distribution.

Summary so far

In the above, I discussed three types that are only available for large models Emergent ability. They are:

  • Complex reasoning, large models significantly outperform previous small models without using all training data.
  • Knowledge reasoning, large models may not be as effective as small models, but large models do not require additional sources of knowledge (knowledge may be expensive or difficult to extract from unstructured data) .
  • Out-of-distribution robustness, this is a problem that needs to be solved when fine-tuning the model. Although large models perform less well than previous methods in the identically distributed case, they generalize much better in the non-identically distributed case.

3. Emergent ability overturns the law of proportion

In view of the advantages listed above, you may start to think that large language models are indeed very good. Before discussing further, let us look back at previous work and we will find a very strange question: GPT-3 was released in 2020, but why did we not discover and start thinking about the paradigm shift until now?

The answer to this question lies in two kinds of curves: logarithmic linear curve and phase change curve. As shown below:

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

Left picture: Law of proportion. When model size grows exponentially, the corresponding model performance grows linearly. Right: When the model size reaches a certain scale, emergent capabilities will appear, allowing performance to increase dramatically.

Initially, (OpenAI) researchers believed that the relationship between language model performance and model size could be predicted by a log-linear curve, that is, the model size increases exponentially , performance will increase linearly. This phenomenon is known as the scaling law of language models, as discussed by Kaplan et al. in their original 2020 GPT3 article. Importantly, at that stage, even the largest GPT-3 could not outperform small model fine-tuning with hints. So there was no need to use expensive large models at that time (even though the labeling of prompt words was very efficient). Until 2021, Cobbe et al[28] found that the scaling law also applies to fine tuning. This is a somewhat pessimistic finding because it means that we may be locked in model size - although model architecture optimization may improve model performance to a certain extent, the effect will still be Locked within a range (corresponding to the model size), it is difficult to have a more significant breakthrough. Under the control of the scaling law (2020 to 2021), since GPT-3 cannot outperform fine-tuning T5-11B, and fine-tuning T5-11B is already very troublesome, so NLP The community's focus is more on studying smaller models or efficient parameter adaptation. Prefix tuning[29]

is an example of the crossover between prompts and adaptation, later unified by He et al.

[30] in 2021. The logic at that time was very simple: If the fine-tuning effect is better, we should work more on efficient parameter adaptation; if the prompt word method is better, we should invest more energy in training large language models. Later in January 2022, the work of Thought Chain was released. As the authors show, thought chain cues exhibit a clear phase transition

in the performance-proportion curve. When the model size is large enough, performance improves significantly and clearly exceeds the scaling curve.

When using thought chains for prompts, the large model performs significantly better than fine-tuning on complex reasoning, performs competitively on knowledge reasoning, and is distributed robust There is also some potential. It only takes about 8 examples to achieve such an effect, which is why the paradigm may shift (Note: This article was completed a month before ChatGPT went online; after ChatGPT went online, the entire field was shocked and realized that the paradigm had shifted ).

4. What does paradigm shift mean?

What does a paradigm shift actually mean? Below we give a comparison of fine tuning and prompt word methods:

The benefits of prompt words are obvious: we no longer need tedious data annotation and fine-tuning on the full amount of data. We only need to write prompt words and obtain results that meet the requirements, which is much faster than fine-tuning.

Two other points to note are:

Is contextual learning supervised learning?

  • Frankly, I’m not sure.
  • The similarity is that context learning also requires examples like training data
  • The difference is that the generalization behavior of context learning And unlike supervised learning, this makes previous generalization theories (such as Rademancher Complexity or Neural Tangent Kernel) inapplicable.

Is contextual learning really better than supervised learning?

  • The answer is still unknown.
  • Most comparisons between prompt words and fine-tuning only compare the large prompt word model vs the fine-tuned small model, but a fair comparison should be the large prompt word model vs the fine-tuned large model , and the base model when comparing should be the same. So in the original ThoughtChain article, if Wei et al. wanted to show that cue words are better than fine-tuning, they should compare fine-tuned PaLM, not GPT3.
  • My hypothesis is: Fine-tuning can improve performance within the distribution, but hurts robustness outside the distribution. Prompt words perform better in distribution transformation scenarios, but are not as good as fine tuning in the same distribution scenario.
  • If the hypothesis is true, then a question worth investigating is how to fine-tune without sacrificing its contextual learning capabilities
  • Note that the effect of out-of-distribution fine tuning will also change with the model size. For example, in the work of Yang et al. in 2022, the fourth table shows that the out-of-distribution generalization ability of Bart-based will decrease, but Bart-large will improve. For large models, when the distribution of the test set is not much different from the training set, the fine-tuning effect within the distribution should also be improved.

Let’s review the logic mentioned above: If fine tuning is better, we should work hard to study how to optimize parameters efficiently; if prompt words are better, we should Efforts to train better large language models.

So, although we believe that large language models have great potential, There is still no conclusive evidence that whether fine-tuning or cue words is better, so we do not Determine if the paradigm really should shift, or to what extent it should shift. It is very meaningful to carefully compare these two paradigms to give us a clear understanding of the future. We leave more discussion to the next article.

5. How big should the model be?

Two numbers: 62B and 175B.

  • The model requires at least 62B so that the effect of the thought chain can be greater than the standard prompt word method.
  • The model needs to be at least 175B (the size of GPT3), so that the effect of the thinking chain can be greater than the effect of a fine-tuned small model (T5 11B).

62B This number comes from the fifth table of Chung et al.’s 2022 [31] work:

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

#For all models smaller than 62B, using prompt words directly is better than thinking chain. The first model that is better with the thought chain is the result of Flan-cont-PaLM 62B on BBH. The 540B model using thinking chain will get good results on more tasks, but not all tasks are better than fine tuning. In addition, the ideal size can be less than 540B. In the work of Suzgun et al. in 2022[32], the author showed that the use of thought chains in 175B InstructGPT and 175B Codex is better than directly using prompt words. Combining the above results, we get two numbers: 63B and 175B. So, if you want to participate in this game, you must first have a larger than average size model.

However, there are other large models that perform much worse under the thinking chain and cannot even learn the thinking chain, such as the first version of OPT, BLOOM and GPT-3. They are both size 175B. This brings us to our next question.

6. Is scale the only factor?

no.

Size is a necessary but not sufficient factor. Some models are large enough (such as OPT and BLOOM, both 175B), but they cannot do thought chains.

There are two models[33] You can do thinking chain:

  • GPT3 series model, Includes text-davinci-002 and code-davinci-002 (Codex). These are the only two models with strong emergent capabilities that are publicly accessible.
  • In addition to the above two models, other GPT3 models, including the original GPT3, text-davinci-001, and other smaller GPT-3 models, cannot do thought chains.
  • When we say "can do thought chain", we mean that the effect of using the thought chain method is better than directly using prompt words and fine-tuning T5-11B.
  • Also note that code-davinci-002 consistently outperforms [34] text-davinci on language tasks -002. This observation is very interesting and intriguing. This shows that a language model trained on code data can outperform a language model trained on language. So far we don't know why.
  • PaLM series of models, including PaLM, U-PaLM, Flan-PaLM and Minerva. These models are not yet publicly accessible (@Google here, open source them soon).

It is still unclear why there are emergent abilities, but we have found out the factors that may produce emergent abilities:

  • Instruction fine-tuning: GPT-3 text-davinci-002 is the product of fine-tuning using instruction reinforcement learning[35]. Before this, the effect of text-davinci-001 on thinking chain was not good. At the same time, the effect of PaLM[36][37] has also been improved after fine-tuning the instructions.
  • Fine-tuning on the code: Codex code-davinci-002 is fine-tuning on the code, and its effect is consistently better than text-davinci-002. PaLM has also been tweaked in code. On the surface, code has little to do with language, but it seems to play a big role, which we will discuss in a later article.
  • Fine-tuning with thinking chain: When text-davinci-002 was released, Google had released PaLM for 3 months. So OpenAI should have seen work related to the thinking chain. There are also some works[38] that show that directly using the thinking chain data for fine-tuning can stimulate the model's thinking chain ability.

However, all of these factors are speculative at this stage. It is very meaningful to reveal how to train the model to produce emergent capabilities. We will leave more discussion to next article.

7. Conclusion Conclusion

In this article, we carefully studied the emergent ability of language models. We highlight the importance of and opportunities for complex reasoning, knowledge reasoning, and out-of-distribution robustness. Emergent capabilities are very exciting because they can transcend scaling laws and exhibit phase transitions in scaling curves. We discussed in detail whether the research paradigm will actually shift from fine-tuning to contextual learning, but we do not yet have a definite answer because the effects of fine-tuning and contextual learning in in-distribution and out-of-distribution scenarios still need to be compared. Finally, we discuss three potential factors that produce emergent capabilities: instruction fine-tuning, code fine-tuning, and thought-chain fine-tuning. Suggestions and discussions are very welcome.

In addition we also mentioned two interesting issues that have not yet been discussed:

  • Can we fairly compare the effects of fine-tuning and contextual learning?
  • How do we train a large model so that the model can have emergent capabilities and thinking chain capabilities?

For these two questions, we will follow in the following articles​ discussion in.

Chinese-English comparison table

Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT

The above is the detailed content of Interpretation of hot topics: The emergent ability of large models and the paradigm shift triggered by ChatGPT. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete