Home  >  Article  >  Technology peripherals  >  The evolutionary tree of large language models, this is a super-detailed "eating" guide to ChatGPT

The evolutionary tree of large language models, this is a super-detailed "eating" guide to ChatGPT

王林
王林forward
2023-05-04 16:07:061014browse

In the process of actual exploration, practitioners may be struggling to find an AI model suitable for their application: Should they choose LLM or fine-tuning a model? If using LLM, which one should I choose?

Recently, scholars from Amazon, Texas A&M University, Rice University and other institutions have discussed the development process of language models such as ChatGPT, and their article has also been praised by Yann LeCun Retweet.

The evolutionary tree of large language models, this is a super-detailed eating guide to ChatGPT

##Paper: https://arxiv.org/abs/2304.13712

Related resources: https://github.com/Mooler0410/LLMsPracticalGuide

The evolutionary tree of large language models, this is a super-detailed eating guide to ChatGPT

This article will start from the perspective of practical application and discuss the tasks suitable for LLM and the practical issues such as models, data and tasks that need to be considered when selecting a model.

1 Introduction

In recent years, the rapid development of large language models (LLM) has triggered a revolution in the field of natural language processing (NLP). These models are extremely powerful and promise to solve many different kinds of NLP tasks – from natural language understanding (NLU) to generation tasks, and even pave the way to artificial general intelligence (AGI). However, in order to use these models effectively and efficiently, we need to have a practical understanding of their capabilities and limitations, as well as an understanding of the data and tasks involved in NLP.

This paper focuses on various aspects of practical application of LLM in downstream NLP tasks to provide guidance to practitioners and end-users. The goal of this guide is to provide readers with practical and useful advice on whether to use an LLM for a given task and how to choose the most suitable LLM - this will take into account many factors, such as model size, computational requirements, and specific domain. Whether there is a pre-trained model, etc. This article also introduces and explains LLM from a practical application perspective, which can help practitioners and end-users successfully leverage the power of LLM to solve their own NLP tasks.

The structure of this article is: This article will first briefly introduce LLM, in which the most important GPT-style and BERT-style architectures will be discussed. Then we will provide an in-depth introduction to the key factors affecting model performance in terms of data, including pre-training data, training data/tuning data, and test data. In the last and most important part, this article will delve into various specific NLP tasks, introduce whether LLM is suitable for knowledge-intensive tasks, traditional NLU tasks, and generation tasks. In addition, it will also describe the new capabilities and challenges that these models continue to acquire. real-world application scenarios. We provide detailed examples to highlight the usefulness and limitations of LLM in practice.

In order to analyze the capabilities of large language models, this article will compare them with fine-tuned models. We do not yet have a widely accepted standard for the definition of LLM and fine-tuned models. In order to make a practical and effective distinction, the definition given in this article is as follows: LLM refers to a large language model pre-trained on a large-scale data set and does not adjust the data for specific tasks; fine-tuned models are usually smaller, and they are pre-trained Later, further fine-tuning will be done on smaller task-specific data sets to optimize their performance on this task.

This article summarizes practical guidelines for using LLM in:

  • Natural language understanding. When the actual data is not within the distribution range of the training data or there is very little training data, the excellent generalization ability of LLM can be used.
  • Natural language generation. Use the power of LLM to create coherent, contextual, and high-quality text for a variety of applications.
  • Knowledge-intensive tasks. Leverage the vast knowledge stored in LLM to handle tasks that require specific expertise or general world knowledge.
  • Reasoning ability. Understand and utilize the reasoning capabilities of LLM to improve decision-making and problem-solving in a variety of situations.

2 Practical Guide to Models

The evolutionary tree of large language models, this is a super-detailed eating guide to ChatGPT

##Figure 1 : This evolutionary tree of modern LLMs traces the development of language models in recent years, highlighting some of the best-known models. Models on the same branch are more closely related. Transformer-based models are not represented in gray: decoder-only models are the blue branch, encoder-only models are the pink branch, and encoder-decoder models are the green branch. A model's vertical position on the timeline indicates when it was released. Solid squares represent open source models, and empty squares represent closed source models. The stacked bar chart in the lower right corner refers to the number of models for each company and institution.

#This section will briefly introduce the current best-performing LLM. These models have different training strategies, model architectures and use cases. To understand the overall picture of LLMs more clearly, we can divide them into two broad categories: encoder-decoder or encoder-only language models and decoder-only language models. Figure 1 shows the evolution of the language model in detail. Based on this evolutionary tree, we can observe some interesting conclusions:

a) The decoder-only model is gradually becoming the dominant model in LLM development. In the early stages of LLM's development, decoder-only models were not as popular as encoder-only and encoder-decoder models. But after 2021, the emergence of GPT-3 changed the industry picture, and only the decoder model experienced explosive development. At the same time, BERT also brought an initial explosive growth to the encoder-only model, but after that, the encoder-only model gradually faded out of sight.

b) OpenAI continues to maintain its leading position in the direction of LLM, now and likely in the future. Other companies and institutions are playing catch-up to develop models that are comparable to GPT-3 and GPT-4. OpenAI's leading position may be attributed to its continued investment in technology, even if the technology was not widely recognized in its early days.

c) Meta has made outstanding contributions to open source LLM and promoting LLM research. Meta stands out as one of the most generous commercial companies when it comes to its contributions to the open source community, especially related to LLMs, as it open sourced all LLMs it developed.

d) There is a trend towards closed source development in LLM. In the early stages of LLM development (before 2020), the vast majority of models were open source. However, with the launch of GPT-3, companies are increasingly choosing to close-source their models, such as PaLM, LaMDA, and GPT-4. Therefore, it is increasingly difficult for academic researchers to conduct LLM training experiments. This has the consequence that API-based research may become the dominant approach in academia.

e) The encoder-decoder model still has development prospects, because companies and institutions are still actively exploring this type of architecture, and most models are open source. Google has made significant contributions to open source encoder-decoders. However, due to the flexibility and versatility of the decoder-only model, Google's chances of success seem slimmer by persisting in this direction.

Table 1 briefly summarizes the characteristics of various representative LLMs.

The evolutionary tree of large language models, this is a super-detailed eating guide to ChatGPT

Table 1: Characteristics of large language models

2.1 BERT-style language model: encoder - decoder or just encoder

The development of unsupervised learning of natural language has made great progress in recent times because natural language data is easy to obtain and unsupervised training paradigms can be used to better utilize extremely large-scale data sets. A common approach is to predict occluded words in a sentence based on context. This training paradigm is called a Masked Language Model. This training method allows the model to gain a deeper understanding of the relationship between words and their context. These models are trained on large text corpora, using techniques such as the Transformer architecture, and have achieved state-of-the-art performance on many NLP tasks, such as sentiment analysis and named entity recognition. Famous masked language models include BERT, RoBERTa and T5. Due to its successful performance on a variety of tasks, masked language models have become an important tool in the field of natural language processing.

2.2 GPT-style language model: decoder only

Although the architecture of language models is generally task-agnostic, However, these methods require fine-tuning based on data sets for specific downstream tasks. Researchers have found that increasing the size of a language model can significantly improve its performance with few or zero samples. The most successful model in improving performance with few and zero samples is the autoregressive language model, which is trained to generate the next word based on the previous words in a given sequence. These models have been widely used in downstream tasks such as text generation and question answering. Autoregressive language models include GPT-3, OPT, PaLM, and BLOOM. The revolutionary GPT-3 showed for the first time that learning through hints and context can give reasonable results with few/zero samples, and thus demonstrated the superiority of autoregressive language models.

There are also models optimized for specific tasks, such as CodeX for code generation and BloombergGPT for the financial field. A major recent breakthrough is ChatGPT, a model of GPT-3 optimized for conversational tasks that generates more interactive, coherent, and contextual conversations for a variety of real-world applications.

3 A Practical Guide to Data

This section explains the critical role of data in choosing the right model for downstream tasks. The impact of data on model effectiveness begins in the pre-training phase and continues through the training and inference phases.

Key Point 1

(1) When downstream tasks will use data outside the distribution, such as using adversarial samples or data domain changes At this time, the generalization ability of LLM is better than that of fine-tuned model.

(2) When the labeled data is limited, LLM is better than the fine-tuned model; when there is abundant labeled data, both are reasonable choices, depending on the specific task need.

(3) It is recommended to choose a model whose data domain used for pre-training is similar to the data domain of the downstream task.

4 Practical Guide to NLP Tasks

This section will discuss in detail whether LLM is useful on various downstream NLP tasks and the corresponding model capabilities. Figure 2 is a decision flow diagram summarizing all discussions. When faced with a certain task, quick decisions can be made based on this process.

The evolutionary tree of large language models, this is a super-detailed eating guide to ChatGPT

Figure 2: The decision-making process when a user chooses an LLM or a fine-tuned model for an NLP application. This decision flow chart helps users evaluate whether the downstream NLP task at hand meets specific criteria and determine whether an LLM or a fine-tuned model is best suited for their application based on the evaluation results. In the decision-making process in the figure, Y indicates that the conditions are met and N indicates that the conditions are not met. The yellow circle next to Y for the last condition indicates that there is currently no model that is well suited for this type of application.

4.1 Traditional NLU tasks

Traditional NLU tasks They are some basic tasks in the field of NLP, including text classification, named entity recognition (NER), entailment prediction, etc. Many of these tasks can be used as intermediate steps in larger AI systems, such as using NER for knowledge graph construction.

Not applicable to LLM: For most natural language understanding tasks, such as tasks in GLUE and SuperGLUE, if the task already has rich well-annotated data and there are very few data in the test set outside the distribution , then the performance of the fine-tuned model is still better. The gap between small fine-tuned models and LLMs also differs when the tasks and datasets vary.

Suitable for LLM: However, there are some NLU tasks that are better suited to be handled by LLM. Two representative tasks are complex text classification problems and adversarial natural language reasoning.

Key Point 2

For traditional natural language understanding tasks, fine-tuning models are usually a better choice than LLM, but if the task Strong generalization capabilities are needed, then LLM can help.

4.2 Generation Task

The goal of natural language generation is to create coherent, meaningful and contextual Symbol sequences, which roughly include two broad categories of tasks. The first category of tasks focuses on converting input text into new sequences of symbols. Examples include paragraph summarization and machine translation. The second category of tasks is "open generation," where the goal is to generate text or symbols from scratch so that they accurately match the input description, such as writing an email, writing a new article, creating a fictional story, and writing code.

Applicable to LLM: The generation task requires the model to fully understand the input content or requirements and also requires a certain degree of creativity. This is what LLM excels at.

Not applicable LLM: On most translation tasks with rich resources and translation tasks with few resources, fine-tuned models perform better, such as DeltaLM Zcode. For machine translation with rich resources, fine-tuned models slightly outperform LLMs. For machine translation with very few resources, such as English-Kazakh translation, fine-tuned models significantly outperformed LLM.

Key Point 3

Thanks to its strong generation ability and creativity, LLM has advantages in most generation tasks.

4.3 Knowledge-intensive tasks

##Knowledge-intensive NLP tasks are those that rely heavily on background knowledge and expertise in specific fields. Knowledge or general real-world knowledge task category. These tasks require more than pattern recognition or syntactic analysis. They rely heavily on memory and the appropriate use of knowledge related to specific entities, events, and common sense in our real world.

Suitable for LLM: Generally speaking, if there are billions of training tokens and parameters, the amount of real-world knowledge contained in LLM can far exceed that of a fine-tuned model.

Not applicable to LLM: Some other tasks require different knowledge than what is learned by LLM. The required knowledge is not what the LLM learns about the real world. In such a task, LLM has no clear advantage.

Key Point 4

(1) Thanks to the huge real-world knowledge, LLM is good at handling knowledge-intensive tasks. (2) When the knowledge requirements do not match the learned knowledge, LLM will encounter difficulties; or when the task only requires contextual knowledge, the fine-tuning model can achieve the same performance as LLM.

4.4 The ability to expand the scale

Expanding the scale of LLM (such as parameters, training calculations, etc.) can Greatly assists in pre-training language models. By increasing the model size, the model's ability to handle multiple tasks is often improved. Reflected on certain indicators, the performance of the model shows a power law relationship with the model size. For example, the cross-entropy loss used to measure language modeling performance decreases linearly with exponential growth in model size, which is also known as the "scaling-law." For some key capabilities, such as reasoning, scaling up the model can gradually improve these capabilities from a very low level to a usable level, even close to human levels. This subsection will introduce the use of LLM in terms of the impact of scale on the capabilities and behavior of LLM.

LLM use cases in reasoning: Reasoning involves understanding information, making inferences and making decisions, and is a core ability of human intelligence. For NLP, reasoning is extremely challenging. Many existing reasoning tasks can be divided into two categories: commonsense reasoning and arithmetic reasoning. Model enlargement can greatly improve the arithmetic reasoning ability of LLM. Common sense reasoning requires the LLM not only to remember factual knowledge but also to perform some reasoning steps about the facts. Common sense reasoning capabilities gradually improve as the size of the model increases. Compared to fine-tuned models, LLM performs better on most datasets.

LLM use cases in emergent capabilities: Increasing the size of the model can also give the model some unprecedented and wonderful capabilities that transcend power law rules. These abilities are called "emergent abilities." As defined in the paper "Emergent Abilities of Large Language Models": The emergent ability of LLM refers to the ability that small-scale models do not have but appear in large-scale models. (For more interpretations of this paper, please refer to "The new work of Jeff Dean and others: Looking at language models from another angle, unable to be discovered if the scale is not large enough") This means that we cannot infer and predict this ability based on the performance improvement of small-scale models; On some tasks, once the size of the model exceeds a certain level, it may suddenly achieve excellent performance. Emergent capabilities are often unpredictable and unexpected, which can result in a model's ability to handle tasks that arise randomly or are unexpected.

Not applicable LLM and understanding emergence: Although in most cases the model is larger and performs better, there are exceptions.

On some tasks, as the scale of LLM increases, the model performance will begin to decline. This is also known as the Inverse Scaling Phenomenon. In addition, the researchers also observed another interesting phenomenon related to scale, namely the U-shaped Phenomenon. As the name suggests, this phenomenon means that as the LLM model grows larger, its performance on a specific task will initially improve, then start to decline, and then improve again.

To advance research in this area, we must gain a deeper understanding of emergent capabilities, counterscaling phenomena, and U-shaped phenomena.

Key Points 5

(1) As the model size increases exponentially, the arithmetic reasoning and common sense reasoning capabilities of LLM will also increase. (2) As the scale of LLM increases, emergent capabilities can discover new uses by chance, such as word processing capabilities and logical capabilities. (3) Model capabilities do not always increase with scale, and our understanding of the relationship between the capabilities of large language models and scale is still limited.

4.5 Miscellaneous Tasks

In order to better understand the strengths and weaknesses of LLM, we will talk about the ones not mentioned above other tasks involved.

Not applicable LLM: LLM often has difficulty on these tasks if the model goals are different from the training data.

Suitable for LLM: LLM is especially suitable for certain specific tasks. To give some examples, LLM is very good at imitating humans. LLM can also be used to evaluate the quality of certain NLG tasks such as summarization and translation. Some capabilities of LLM can also bring benefits other than performance improvements, such as interpretability.

Key Point 6

(1) For tasks that are far away from the pre-training targets and data of LLM, fine-tuning models and domain-specific models are still There is a place for it. (2) LLM is good at imitating humans, data annotation and generation. They can also be used for quality assessment of NLP tasks and have benefits such as interpretability.

4.6 Real-world "Task"

Finally, this section discusses the use of LLM and fine-tuning models in real-world "Tasks" ” application on. The term "task" is used loosely here because, unlike academic settings, real-world settings often lack well-formed definitions. Many requirements for models cannot even be considered NLP tasks. The real-world challenges faced by the model come from the following three aspects:

  • Noisy/unstructured input. Real-world input comes from real-world people, most of whom are not experts. They don’t understand how to interact appropriately with models and may not even be able to use text fluently. Therefore, real-world input data can be messy, with spelling errors, colloquial text, and multi-lingual jumbles, unlike the well-defined formatted data used for pre-training or fine-tuning.
  • Tasks that have not been formalized by academia. Tasks in real-world scenarios are often not well defined by academia, and the diversity extends well beyond the definition of academic research scenarios. Users often make queries or requests that don't fit neatly into predefined categories, and sometimes a single query encompasses multiple tasks.
  • Follow user instructions. The user's request may contain multiple implicit intentions (such as specific requirements for the output format), or it may not be clear what the user expects to predict without follow-up questions. The model needs to understand the user's intentions and provide output consistent with those intentions.

#Essentially, these real-world puzzles from user requests are caused by deviations from the distribution of any NLP dataset designed for a specific task. Public NLP datasets do not reflect how these models are used.

Point 7

Compared to fine-tuning the model, LLM is more suitable for processing real-world scenarios. However, assessing the effectiveness of models in the real world remains an open question.

5 Other aspects

Although LLM is suitable for a variety of downstream tasks, there are other factors to consider, such as efficiency and reliability. Issues involved in efficiency include the training cost of LLM, inference latency, and tuning strategies for efficient parameter utilization. In terms of trustworthiness, the LLM's robustness and calibration capabilities, fairness and bias, potential error correlations, and security challenges need to be considered. Key Point 8(1) If the task is cost-sensitive or has strict latency requirements, then lightweight local fine-tuning models should be prioritized. When deploying and delivering your model, consider tuning to make efficient use of parameters. (2) LLM’s zero-shot approach prevents it from learning shortcuts from task-specific data sets, which is common for fine-tuned models. Nonetheless, LLM still exhibits certain shortcut learning problems. (3) Since LLM’s potentially harmful or biased output and hallucination issues may lead to serious consequences, security issues related to LLM should receive the greatest attention. Methods such as human feedback promise to alleviate these problems.

6 Summary and Future Challenges

This practical guide provides insights into LLM and best practices for using LLM on a variety of NLP tasks. Hopefully this will help researchers and practitioners harness the potential of LLM and drive innovation in language technology.

Of course, LLM also has some challenges that need to be solved:

  • Evaluate the model on real-world data sets. Although existing deep learning models are mainly evaluated on standard academic datasets such as ImageNet, standard academic datasets are limited and do not accurately reflect the performance of the model in the real world. As models advance, it will be necessary to evaluate them on more diverse, complex, and realistic data that reflects real needs. Evaluating models on both academic and real-world datasets allows the models to be more rigorously tested and allows us to better understand their effectiveness in real-world applications. This ensures that the model has the ability to solve real-world problems and deliver practical, usable solutions.
  • Model Alignment. It is important to ensure that increasingly powerful and automated models are aligned with human values ​​and priorities. We have to figure out how to make sure the model behaves as expected and not optimize the model for results we don't want. It is important to integrate accurate techniques from the beginning of the model development process. Model transparency and interpretability are also important in assessing and ensuring accuracy. In addition, looking to the future, there is an even more difficult challenge emerging: accurate execution of superhuman systems. Although this task currently exceeds our needs, it is important to consider and prepare for advanced systems such as Hezhun, as they may pose unique complexities and ethical issues.
  • Safety Alignment. While it’s important to discuss the existential risks posed by AI, we need practical research to ensure that advanced AI can be developed safely. This includes techniques for interpretability, scalable supervision and governance, and formal verification of model properties. In the construction of the model, security should not be viewed as an add-on but as an integral part of the whole.
  • Predict model performance as its size changes. When model size and complexity increase significantly, it is difficult to predict how the model will perform. Techniques should be developed to better predict how models will perform as they scale up or use new architectures, which will allow us to use resources more efficiently and speed up development. There are some possibilities: training a smaller "seed" model and predicting its growth by extrapolation, simulating the effects of scaling up or adjusting the model, and iterating on a test bench of models at different sizes to build a scaling law . This gives us an idea of ​​how the model will perform before we build it.

The above is the detailed content of The evolutionary tree of large language models, this is a super-detailed "eating" guide to ChatGPT. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete