Home > Article > Technology peripherals > Talk about the data-centric AI behind the GPT model
Artificial intelligence (AI) is making huge strides in changing the way we live, work, and interact with technology. Recently, an area where significant progress has been made is the development of large language models (LLMs) such as GPT-3, ChatGPT, and GPT-4. These models can accurately perform tasks such as language translation, text summarization, and question answering.
While it is difficult to ignore the ever-increasing model sizes of LLMs, it is equally important to recognize that their success is largely due to the large number of High quality data.
In this article, we will provide an overview of recent advances in LLM from a data-centric AI perspective. We'll examine GPT models through a data-centric AI lens, a growing concept in the data science community. We reveal the data-centric AI concepts behind the GPT model by discussing three data-centric AI goals: training data development, inference data development, and data maintenance.
LLM is a natural language processing model trained to infer words in context. For example, the most basic function of LLM is to predict missing markers given context. To do this, LLMs are trained to predict the probability of each candidate word from massive amounts of data. The figure below is an illustrative example of using LLM in context to predict the probability of missing markers.
GPT model refers to a series of LLMs created by OpenAI, such as GPT-1, GPT-2, GPT-3, InstructGPT, ChatGPT/GPT-4, etc. Like other LLMs, the architecture of the GPT model is mainly based on Transformers, which uses text and location embeddings as inputs and uses attention layers to model the relationships of tokens.
GPT-1 model architecture
Later GPT models use a similar architecture to GPT-1, except that they use more model parameters and more layers. Larger context length, hidden layer size, etc.
Data-centric AI is an emerging new way of thinking about how to build AI systems. Data-centric AI is the discipline of systematically designing the data used to build artificial intelligence systems.
In the past, we have mainly focused on creating better models (model-centric AI) when the data is basically unchanged. However, this approach can cause problems in the real world because it does not take into account different issues that can arise in the data, such as label inaccuracies, duplications, and biases. Therefore, "overfitting" a data set does not necessarily lead to better model behavior.
In contrast, data-centric AI focuses on improving the quality and quantity of data used to build AI systems. This means that the attention is on the data itself and the model is relatively more fixed. Using a data-centric approach to develop AI systems has greater potential in real-world scenarios, as the data used for training ultimately determines the model's maximum capabilities.
It should be noted that there is a fundamental difference between "data-centered" and "data-driven". The latter only emphasizes using data to guide the development of artificial intelligence, and usually still focuses on developing models rather than data. .
Comparison between data-centric artificial intelligence and model-centric artificial intelligence
The data-centric AI framework contains three Target:
Data-centric AI framework
A few A few months ago, Yann LeCun tweeted that ChatGPT was nothing new. In fact, all the techniques used in ChatGPT and GPT-4 (transformers, reinforcement learning from human feedback, etc.) are not new at all. However, they did achieve results that were not possible with previous models. So, what is the reason for their success?
Training data development. The quantity and quality of data used to train GPT models has improved significantly through better data collection, data labeling, and data preparation strategies.
Inference data development. Since recent GPT models have become powerful enough, we can achieve various goals by adjusting hints or adjusting inference data while the model is fixed. For example, we can perform text summarization by providing the text to be summarized and instructions such as "summarize it" or "TL;DR" to guide the reasoning process.
Designing the right reasoning prompts is a challenging task. It relies heavily on heuristics. A good survey summarizes the different promotional methods. Sometimes, even semantically similar cues can have very different outputs. In this case, soft cue-based calibration may be needed to reduce variance.
#Research on LLM inference data development is still in its early stages. In the near future, more inferential data development techniques that have been used for other tasks can be applied in LLM.
data maintenance. ChatGPT/GPT-4, as a commercial product, is not only trained once, but also continuously updated and maintained. Obviously, we have no way of knowing how data maintenance is done outside of OpenAI. Therefore, we discuss some general data-centric AI strategies that have been or will likely be used for GPT models:
- Continuous data collection: When we use ChatGPT/GPT-4 Our tips/feedback may in turn be used by OpenAI to further advance their models. Quality metrics and assurance strategies may have been designed and implemented to collect high-quality data during the process.
- Data Understanding Tools: Various tools can be developed to visualize and understand user data, promoting a better understanding of user needs and guiding the direction of future improvements.
- Efficient data processing: With the rapid growth of the number of ChatGPT/GPT-4 users, an efficient data management system is needed to achieve rapid data collection.
The picture above is an example of ChatGPT/GPT-4 collecting user feedback through "likes" and "dislikes".
The success of LLM has revolutionized artificial intelligence. Going forward, LLM can further revolutionize the data science lifecycle. We make two predictions:
Many tedious data science tasks can be more effective with the help of LLM carried out. For example, ChaGPT/GPT-4 already makes it possible to write working code to process and clean data. Additionally, LLM can even be used to create training data. For example, using LLM to generate synthetic data can improve model performance in text mining.
The above is the detailed content of Talk about the data-centric AI behind the GPT model. For more information, please follow other related articles on the PHP Chinese website!