Home >Technology peripherals >AI >Why do all GPT-3 replications fail? What you should know about using ChatGPT
This tweet was written on February 12, 2023. It is my personal opinion and is for reference only.
#Why do all public reproductions of GPT-3 fail? For which tasks should we use GPT-3.5 or ChatGPT?
This tweet will include my summary after carefully re-examining the details of a series of articles, as well as my personal response to the above two issues. Thinking. These articles include but are not limited to: GPT-3, PaLM, BLOOM, OPT, FLAN-T5/PaLM, HELM, etc. If you have more reliable reference materials or more practical experience, please correct me.
For those who want to reproduce a GPT-3 or ChatGPT of their own, the first question is critical. The second question is important for those who want to use them (whenever GPT-3 is mentioned below, it mainly refers to GPT-3.5 or the latest version of InstructGPT, except for some cases that refer to the original GPT-3 text).
Here, I call it "failure", which means that the trained model has a parameter number close to GPT-3 or larger, but still cannot match the performance reported in the original GPT-3 literature. matched. By this criterion, GPT-3 and PaLM are "successful", but neither model is public. And all public models (for example: OPT-175B and BLOOM-176B) have "failed" to some extent. But we can still learn some lessons from these “failures.”
We need to note that if we can try various training settings multiple times, the open source community may eventually be able to reproduce GPT-3. But as of now, the cost of training another version of OPT-175B is still too high - for such a large-scale model, a training will require at least 2 months on about 1000 80G A100 GPUs (data from from the original literature of OPT).
Although some articles (such as OPT-175B and GLM-130B) claim that they can match or even exceed the performance of the original GPT-3 on some tasks, more GPT-3 has On tested tasks, this claim remains questionable. At the same time, according to the experience of most users on more diverse tasks and the evaluation of HELM, the recent OpenAI GPT-3 API performance is still better than these open source models.
Although the model behind it may use instruction tuning (just like InstructGPT), the OPT version (OPT-IML) and the BLOOM version (BLOOMZ) of similar instruction tuning are used ) is still much worse than InstructGPT and FLAN-PaLM (a fine-tuned version of PaLM).
According to the details of the article, there are multiple possible reasons for the failure of OPT-175B and BLOOM-176B compared to the success of GPT-3 and PaLM. I divided it into two parts: pre-training data and training strategy.
Pre-training data
Let us first observe how GPT-3 prepares and uses pre-training data. GPT-3 is trained on a total of 300B tokens, 60% of which come from the filtered Common Crawl, and the others come from: webtext2 (the corpus used to train GPT-2), Books1, Books2 and Wikipedia.
The updated version of GPT-3 also uses code data sets for training (such as Github Code). The proportion of each part is not proportional to the size of the original dataset; instead, datasets with higher quality are sampled more frequently. What led to the failure of OPT-175B and BLOOM-176B may be the following three difficulties, which make it difficult for the open source community to collect similar data:
1. The first point is a company with good High-performance classifier for filtering low-quality data. It was used to build the pre-training datasets of GPT-3 and PaLM, but was not used in the training of OPT and BLOOM. Some articles have shown that a pre-trained model trained with fewer but higher quality datasets can outperform another model trained with more mixed-quality datasets. Of course, data diversity is still very important, as we will discuss in point 3. Therefore, one should handle the trade-off between data diversity and quality very carefully.
2. The second point is the deduplication of the pre-training data set. Deduplication helps prevent pre-trained models from memorizing or overfitting on the same data after being faced with it multiple times, thus helping to improve the model's generalization ability. GPT-3 and PaLM adopt document-level deduplication, which is also adopted by OPT. However, there are still many repetitions in the OPT pre-trained deduplication Pile corpus, which may also lead to its poorer performance (Note: Some recent literature shows that the importance of deduplication for pre-training language models may not be as great as imagined. ).
3. The third point is the diversity of pre-training data sets, including domain diversity and format diversity (for example: text , codes and tables) and language diversity. The Pile corpus used by OPT-175B claims to have better diversity, but the ROOTS corpus used by BLOOM has too many existing academic data sets and lacks the diversity contained in the Common Crawl data. This may result in worse BLOOM performance. In comparison, the proportion of GPT3 from the Common Crawl corpus is much higher, and they are diverse and come from a wide range of fields. This may also be one of the reasons why GPT-3 can be used as the basic model of the first general chatbot ChatGPT. .
Please note: Although in general, diverse data is important for training a general LLM (Large Language Model, large-scale language model), specific pre-training data Distribution will have a huge impact on the performance of LLM on specific downstream tasks. For example, BLOOM and PaLM have a higher proportion of multilingual data, which leads to their higher performance on some multilingual tasks and machine translation tasks.
OPT uses a lot of conversation data (such as reddit), which may be one of the reasons why it performs well in conversations. PaLM accounts for a large proportion of social media conversations, which may be the reason for its excellent performance on a variety of question and answer tasks and data sets. Likewise, PaLM and newer versions of GPT-3 have a large proportion of code datasets, which enhances their capabilities on coding tasks and possibly their CoT (Chain-of-Thought) capabilities.
An interesting phenomenon is that BLOOM’s performance on code and CoT is still poor, although it uses code data in the pre-training process. This may imply that code data alone, does not guarantee a model's code and CoT capabilities.
In short, some articles have shown the importance of the above three points, namely: avoiding memory and overfitting through data deduplication, obtaining high-quality data through data screening, and ensuring data diversity to ensure the generalization of LLM. Unfortunately, the details of how PaLM and GPT-3 preprocess these data, or the pre-training data themselves, are still not published, making it difficult for the public community to reproduce them.
Training strategy
The training strategy here includes training framework, training duration, model architecture/training settings, Modifications during training. They are used to obtain better stability and convergence when training very large models. Generally speaking, loss spikes and failure to converge are widely observed during pre-training due to unknown reasons. Therefore, numerous modifications to training settings and model architectures have been proposed to avoid these problems. However, some of these modifications are not optimal solutions in OPT and BLOOM, which may lead to their poor performance. GPT-3 doesn't explicitly mention how they solve this problem.
1. Training framework. A model with more than 175B parameters often requires ZeRO-style data parallelism (distributed optimizer) and model parallelism (including tensor parallel, pipeline parallel, and sometimes sequence parallel) ). OPT uses ZeRO's FSDP implementation, and the model-parallel Megatron-LM implementation. BLOOM uses ZeRO's Deepspeed implementation and the model-parallel Megatron-LM implementation.
PaLM uses Pathways, a TPU-based model parallelism and data parallelism system. The details of GPT-3's training system are still unknown, but they use model parallelism at least to some extent (some say it uses Ray). Different training systems and hardware may lead to different training phenomena. Obviously, some of the settings presented in the PaLM article for TPU training may not be applicable to GPU training used by all other models.
An important impact of hardware and training framework is whether one can use bfloat16 to store model weights and intermediate layer activation values, etc. This has proven to be an important factor in stable training, as bfloat16 can represent a wider range of floating point numbers and is able to handle large values that occur when loss spikes occur. On TPU bfloat16 is the default setting, which may be a secret to PaLM's success. But on GPUs, people used to mainly use float16, which was the only option for mixed precision training in V100.
OPT uses float16, which may be one of its unstable factors. BLOOM discovered such a problem and ended up using bfloat16 on the A100GPU, but it did not realize the importance of this setting and therefore introduced an additional layer normalization after the first word vector layer for to resolve instabilities in their preliminary experiments using float16. However, this layer normalization has been shown to lead to worse zero-shot generalization, which may be a factor in BLOOM's failure.
2. Modifications during training. OPT made many mid-stream adjustments and restarted training from the most recent checkpoint, including changing the clip gradient norm and learning rate, switching to a simple SGD optimizer and then back to Adam, and resetting the dynamic loss scalar. ), switch to a newer version of Megatron, etc.
This mid-course adjustment may be one of the reasons for the failure of OPT. In contrast, PaLM made almost no mid-stream adjustments. It simply restarts training from a checkpoint about 100 steps before the spike when the loss spike occurs, and skips about 200-500 batches of data. Just relying on this simple restart, PaLM achieved magical success. This is due to the fact that it has completed sampling during the construction of the pre-training data, so the model is deterministic in the Bit sense, and it has made many modifications to the model architecture and training settings for better stability. Such modifications in PaLM are shown in the next point.
3. Model architecture/training settings: In order to make training more stable, PaLM has made a number of adjustments to the model architecture and training settings, including using a modified version of Adafactor as the optimizer, scaling Output logit before softmax, use auxiliary loss to encourage softmax normalizer close to 0, use different initialization for word vectors and other layer weights, do not use bias terms in feedforward layers and layer normalization, and use pre- Dropout is not used during training.
Please note that there is more valuable content in GLM-130B on how to stably train very large models, e.g. using DeepNorm-based post-layer normalization. Instead of pre-layer normalization, and word vector layer gradient shrinkage. Most of the above model modifications are not adopted by OPT and BLOOM, which may lead to their instability and failure.
4. Training process: As shown in the table below, the number of tokens seen by the original GPT-3 pre-training process is close to that of OPT and BLOOM, while PaLM far exceeds them. Likewise, both PaLM and GPT-3 pre-training corpora are larger than BLOOM and OPT. Therefore, pre-training on more tokens and with a larger high-quality corpus may be an important factor in the success of GPT-3 and PaLM.
##In addition to the four points listed above, there are some other factors that may affect More stable training is less critical, but may still affect final performance.
First point, both PaLM and GPT-3 use a batch size that gradually increases from small to large during the training process, which has been shown to be effective in training a better LLM. , however both OPT and BLOOM use a constant batch size.
Second point, OPT uses the ReLU activation function, while PaLM uses the SwiGLU activation function, GPT-3 and BLOOM use GeLU, which usually makes the performance of the trained LLM better.
Third point, in order to better model longer sequences, PaLM uses RoPE word vectors, BLOOM uses ALiBi word vectors, and the original GPT-3 and OPT use learned word vectors, which may affect performance on long sequences.
I try to explain which tasks and applications we should use GPT-3 for and which ones we shouldn't. To show whether GPT-3 is suitable for a specific task, I mainly compared GPT-3 with prompting to smaller models that have been fine-tuned, sometimes with other special features. This issue is even more important given the good performance of the recently emerged smaller and fine-tunable FLAN-T5 models.
In an ideal world, if the burden of fine-tuning GPT-3 is affordable, it could lead to further improvements. However, the improvements achieved by fine-tuning PaLM-540B on some tasks are so limited that one wonders whether fine-tuning GPT-3 on some tasks is worthwhile. From a scientific perspective, a fairer comparison would be between fine-tuned GPT-3 and cued GPT-3. However, to use GPT-3, one might be more interested in comparing GPT-3 to fine-tuning a smaller model.
Note that I am mainly concerned with the accuracy of completing tasks as a measurement, but there are still many other important dimensions, such as: toxicity, fairness, etc., They should also be taken into consideration when deciding whether to use GPT-3, as presented in the HELM article. The diagram below shows a rough decision-making process, and hopefully it will serve as a useful practical guide, whether for an existing task or a completely new one.
#Note 1 : ChatGPT excels as a chatbot due to its good alignment in conversational scenarios. But we usually use GPT-3, InstructGPT (GPT-3.5), and Codex, the models behind ChatGPT, as general models in more tasks and usage scenarios.
Note 2: The conclusions in this section are based on some findings on the current version of the model , which may not apply to future stronger models. Because, using more pre-training data that is close to the target data set, academic data set instruction adjustment (such as suggesting that a FLAN-PaLM may bring stronger performance, it is still undisclosed) or through RLHF to make the model better for the target task Better alignment, which may make the model perform better in the target task, even if sometimes this sacrifices the ability in other scenarios (for example, InstructGPT's "Alignment tax/Alignment tax").
In this case, it is difficult to judge whether GPT is generalizing and generalizing across tasks, or whether it has just memorized some test samples during pre-training, or whether Have seen those so-called "unseen" tasks during pre-training. However, it remains questionable whether memory is actually a serious problem in practice. Because users are different from researchers, if they find that GPT can already perform well on their test data, they may not care whether GPT saw the same or similar data during pre-training.
In any case, in order to maximize the current practical value of this section, I tried my best to compare and fine-tune public smaller models (T5, FALN-T5, some special The best performance of the designed fine-tuned SOTA model, etc.) and the recent GPT-3 (GPT-3.5, InstructGPT), PaLM (or FLAN-PaLM), if the evaluation data of these models are available.
Tasks suitable for using GPT-3
Generally speaking, the following situations are more suitable for using GPT. -3. Surprisingly, if we look back at the introductory section of the GPT-3 paper, a lot of the initial design goals there covered these tasks. This means that those originally ambitious goals have been partially achieved.
1. Creative and complex tasks: including code (code completion, natural language instruction generation code, code translation, bug fixing ), text summarization, translation, creative writing (such as writing stories, articles, emails, reports, and writing improvement, etc.). As shown in the original GPT-3 literature, GPT-3 is designed for those difficult and “impossible annotation” tasks. These are tasks to the extent that previously fine-tuned models would have been impossible to apply to real-world applications; GPT-3 makes them possible. For example, recent articles show that past human-annotated text summarization has been surpassed by LLM-generated summaries.
By prompting PaLM-540B, it is even able to outperform fine-tuned models in certain machine translation tasks that require translation from low- and medium-resource languages to English.
A similar trend was observed in BLOOM-176B. This is because English data usually accounts for a large proportion of the pre-training corpus, so LLM is good at generating English sentences. Note that in order to obtain good performance in coding tasks, although Codex and PaLM have overall better performance than previous models, we still need to allow LLM to sample multiple times (k times) to pass the test sample (using pass @k as a metric).
2. Tasks with only a few labeled or unlabeled data. As the original GPT-3 documentation states, GPT-3 is designed for “expensive annotation” tasks. In this case, it is usually impossible to fine-tune a smaller model with a very small amount of labeled data to achieve GPT-3 in the zero-shot, one-shot, or few-shot case. lower performance.
3. Out-of-distribution (OOD) generalization. Given some training data, traditional fine-tuning may overfit the training set and have poor out-of-distribution generalization; while few-sample in-context learning can have better out-of-distribution generalization. For example, PaLM with hints can outperform a fine-tuned SOTA model on the Adversarial Natural Language Inference (ANLI) task, while it may still be inferior to the fine-tuned SOTA on normal language inference tasks.
Another example is the hint that LLM shows better combinatorial generalization than fine-tuned models. Better out-of-distribution generalization may be because parameters do not need to be updated during context learning, avoiding overfitting, or because those past out-of-distribution examples are in-distribution to the LLM. This use case was explained as one of the original design goals of GPT-3: "Fine-tuned models can achieve so-called human-level performance on a dataset for a specific task, which may actually overstate the performance on that task in the real world." , this is because the model only learned spurious correlations that existed in the training set, and the model overfitted the narrow distribution of this training set."
4 . Requires the ability to handle multiple tasks rather than focusing on excellence in a specific task. Chatbots are one such scenario where users expect it to respond correctly to a variety of tasks. This is probably why ChatGPT is one of the most successful use cases for GPT-3.
5. Those knowledge-intensive tasks for which retrieval is not feasible. The knowledge stored in LLM can significantly improve the performance on knowledge-intensive tasks, such as closed-book question answering and MMLU (a benchmark data set including multiple-choice questions from 57 disciplines such as STEM, humanities, social sciences, etc., which is used to test LLM world knowledge and problem solving skills). However, if a pre-retrieval step can be added to generate retrieval enhancements, a fine-tuned smaller model (such as the Atlas model) can even have better performance (Atlas than PaLM on the closed-volume NaturalQuestions and TrivialQA datasets). and the latest InstructGPT are better).
Retrieval or traditional search is also a necessary step to integrate GPT-3 or ChatGPT into the search engine, which can improve the accuracy of the generation and provide more reference links to enhance persuasiveness. But we should admit that there are some cases where retrieval is not allowed or not easy, such as taking USMLE (United States Medical Licensing Examination), where Google has proven that FLAN-PaLM based models can do well.
Similarly, in the MMLU benchmark set, PaLM-540B has better performance than other fine-tuned models, even the latter combined with retrieval, although the latest version of InstructGPT is still worse than these bands There is a fine-tuned SOTA for retrieval. Note also that command tuning of a smaller model can achieve results close to those of a larger LLM model, as has been shown in FLAN-T5.
#6. Some difficult tasks that require the emergent ability of LLM, such as reasoning with CoT and BIG-Bench Complex tasks (including logical reasoning, translation, question and answer, mathematical tasks, etc.). For example, PaLM has shown that on 7 multi-step reasoning tasks including mathematical and common sense reasoning, the 8-sample CoT is better than the fine-tuned SOTA on 4 of the tasks, and is basically the same on the other 3 tasks. .
Such successful performance can be attributed to both the larger model and the CoT. PaLM also shows discrete performance improvements on the BIG-Bench task from 8B to 62B to 540B models, which exceeds the scaling law, known as the emergent power of LLMs. Additionally, PaLM-540B with 5 prompts outperforms the previous (few-sample) SOTA on 44 out of 58 common tasks on Big-Bench. The overall performance of PaLM-540B on Big-Bench is also better than the average human performance.
7. Some scenes that require imitation of humans, or the goal is toproduce performance that reaches human levels General Artificial Intelligence. Likewise, ChatGPT is one of those cases where ChatGPT achieved phenomenal success by making itself more like a human being. This was also explained as one of the original design goals of GPT-3: "Humans do not need large-scale supervised data sets to learn most language tasks. With only a few examples at most, humans can seamlessly integrate various tasks and techniques Mixed together or switched between them. Traditional fine-tuned models thus lead to unfair comparisons with humans, despite their claims of human-level performance on many benchmark datasets.”
8. On some traditional NLP tasks that are close to language modeling, the few-sample PaLM-540B can roughly match or even exceed the fine-tuned SOTA, such as: the last sentence of a paragraph and last word cloze, and anaphora analysis. It should be noted that in this case, a zero-sample LLM is sufficient, and single-sample or few-sample examples are usually of little help.
Other tasks do not require prompting a model of the size of GPT-3:
Not suitable for use Tasks of GPT-3
#1. Calling the API of OpenAI GPT-3 exceeds the budget (for example, for a startup company without much money).
2. There are security issues in calling the API of OpenAI GPT-3 (such as data leakage to OpenAI, or harmful content that may be generated).
3. There are not enough engineering or hardware resources to deploy a model of similar size and eliminate the delay problem of inference. For example, without state-of-the-art 80G A100s or engineering resources to optimize inference speed, simply using Alpa to deploy OPT-175B on 16 40G A100s would take 10 seconds to complete the inference of a single sample, which is not feasible for large-scale applications. This is unacceptable latency for most real-world online applications.
4. If you want to use GPT-3 to replace a fine-tuned model with good performance and high accuracy, or you want to deploy an NLU in some specific single tasks and usage scenarios ( Natural Language Understanding) or NLG (Natural Language Generating) models, please think twice whether it is worth it.
To summarize, the above tasks can be classified into one of the following categories:
1. Some NLU tasks require neither additional knowledge nor the generation capabilities of LLM. This means that the test data is mostly in the same distribution as the training data at hand. On these tasks, smaller models fine-tuned in the past have performed well.
2. Some tasks that do not require additional knowledge from LLM, Because each example already contains enough knowledge in the context or prompts, such as machine reading comprehension.
3. Some require additional knowledge that is unlikely to be obtained from LLM, or It is unlikely that LLM has ever seen similarly distributed tasks, such as those in some low-resource languages where LLM has only limited pre-training samples. 4.
Some tasks require knowledge that is inconsistent with the knowledge contained in LLM, or knowledge that is not based on real-world language data. Because LLM is trained on real-world language data, it is difficult for it to use counterfactual knowledge to cover the original knowledge in new tasks. In addition to the “redefine mathematical notation” problem in the inverse scale law challenge, there is another task, retelling a slightly altered quote, in which the LLM is asked to retell a modified quote that appears in the prompt. In this case, LLM tends to repeat the original version of the quote rather than the modified version. 5.
Some tasks require knowledge from LM, but also rely heavily on manipulating this knowledge, and LLM's "predict next token” goal cannot easily achieve this kind of manipulation. An example is some common sense reasoning tasks. The reason why CoT and least-to-most hints can help LLM inference may be that they can better call out those continuous pre-training texts that happen to mimic the process of planning and decomposing/combining knowledge. Thus, CoT and least-to-most prompts perform well in some mathematical reasoning, code and other simple natural language reasoning tasks, but fail in many common sense reasoning (e.g. in inverse scaling Still perform poorly on deductive reasoning tasks demonstrated in the law competition) and custom symbolic reasoning tasks. These tasks are usually not covered by the most real-world continuous sequences in natural language data, but require the manipulation of scattered knowledge to complete.
6.
Some tasks that are susceptible to spurious correlations in contextual learning examples or real-world data. One example is a question and answer involving negative words from the Inverse Law of Scale competition. If an LLM is asked: "If a cat's body temperature is below average, it is not in...", it will tend to answer "at risk" rather than "safe range". This is because LLM is dominated by the common relationship between "below average body temperature" and "danger", which in the negative case is a spurious correlation. 7.
Some tasks whose goals are significantly different from processing language data, such as regression problems, where fine-tuning models are difficult to replace by LLM . As for multi-modal tasks, they cannot be solved by LLM, but may benefit from large-scale pre-trained multi-modal models. 8.
Some tasks do not require the emergent ability of LLM. To accurately discriminate on more of these tasks, we need to better understand where emergent capabilities arise during LLM training. Note that in real-world usage scenarios, even if LLM cannot be used online due to inability to meet latency requirements, LLM can still be used to generate or label data offline. Such automatically annotated labels can be found online and provided to users, or used to fine-tune smaller models. Using such data to fine-tune smaller models reduces the manually annotated data required to train the model and injects some of LLM's emerging capabilities (such as CoT) into smaller models.
In summary, given the amazing performance of open source FLAN-T5 in many tasks when there is enough labeled data, I recommend that individuals with limited resources calling the OpenAI API should first Try fine-tuning FLAN-T5-11B on target tasks. Furthermore, based on the recent performance of FLAN-PaLM-540B on the MMLU dataset, which is surprisingly good compared to the performance of the latest version of InstructGPT (according to HELM), Google may have a more powerful base model than OpenAI, if OpenAI has passed the API Released the strongest LLM they have ever obtained.
The only remaining step for Google is to align this LLM with conversational scenarios through human feedback. I wouldn't be surprised if they release a ChatGPT-like or better chatbot soon - despite their recent "failure" to show off a version of Bard that might be based on LaMDA.
##About the author
Original English author: Yang Jingfeng, current Amazon scientist, graduated from Peking University with a bachelor's degree and a master's degree Georgia Institute of Technology, under the tutelage of Stanford Professor Yang Diyi.
Translated by Yang Haotong and revised by Wang Xiao.
Thanks to Jin Hongye for his suggestions on the first version of the manuscript, and to Chen Sanxing and Fu Yao for their discussions and suggestions.
English original version: https://jingfengyang.github.io/gpt
##Push Original text: https://twitter.com/JingfengY/status/1625003999387881472
The above is the detailed content of Why do all GPT-3 replications fail? What you should know about using ChatGPT. For more information, please follow other related articles on the PHP Chinese website!