Home >Technology peripherals >AI >Starting from GPT-3, continue to write Transformer's huge family tree

Starting from GPT-3, continue to write Transformer's huge family tree

WBOY
WBOYforward
2023-04-30 23:34:10651browse

Recently, the big language model arms war has taken up most of the space in the circle of friends. There have been many articles discussing what these models can do and what their commercial value is. However, as a young researcher who has been immersed in the field of artificial intelligence for many years, I am more concerned about the technical principles behind this arms race and how these models are engineered to benefit mankind. Rather than looking at how these models can be made money and engineered to bring benefits to more people, what I want to explore is the reason behind this phenomenon, and what we researchers can do to achieve "being replaced by AI" before AI replaces humans. Then retire with honor” and do something about it.

Three years ago, when GPT-3 caused an uproar in the technology world, I tried to analyze the huge family behind GPT in a historical way. I sorted out the technical context behind GPT in chronological order (Figure 1), and tried to explain the technical principles behind GPT’s success. This year, ChatGPT, the younger son of GPT-3, seems to be smarter and can communicate with people through chat, which makes more people aware of the latest progress in the field of natural language processing. At this historic moment, as AI historians, we should perhaps take a moment to look back at what has happened in recent years. The first article uses GPT-3 as a starting point, so this series is actually a record of the post-GPT era (post-GPT book). While exploring the changes in the GPT family, I realized that most of the stories are related to Transformer, so The name of this article is Transformer family.

Starting from GPT-3, continue to write Transformers huge family tree

Figure 1. GPT old genealogy

Previous review

Before we officially start introducing the Transformer family, let’s review what happened in the past according to Figure 1. Starting from Word Embedding [1,2], the vector (a string of numbers) includes the semantics of the text in a strange but effective way. Figure 2 shows an illustration of this representation: represented by numbers (King - Man woman = queen). Based on this, this huge NLP (natural language processing) family was created.

Starting from GPT-3, continue to write Transformers huge family tree

##Figure 2. Word2Vec diagram (King - Man Woman = Queen)

After this, his eldest son ELMo [3] discovered the importance of context, such as the following two sentences:

"Oh! You bought my favorite Pizza, I love you so much!"

"Ah, I love you so much! Did you rub my favorite pizza on the ground?"

The meaning of "I love you so much" is obviously different. ELMo successfully solved this problem by "giving a model a string of words, and then asking the model to predict the next word and the previous word (context)."

At the same time, a distant cousin of Word Embedding discovered another problem - when people understand a sentence, they will focus on some words. One is very obvious. The phenomenon is that when we read our native language, many typos will be easily ignored. This is because we are not paying attention to it when understanding the passage. Therefore, he proposed the Attention mechanism [4], but the Attention mechanism at this time was very early and could not work alone, so it could only be attached to sequence models such as RNN and LSTM. Figure 3 shows the combination process of attention mechanism and RNN, and also explains why Attention itself cannot work alone. Let’s briefly talk about the working process of the NLP model. First, we have a sentence, such as "I love you China". This is five characters, which can be turned into x_1-x_5 in Figure 3, and then each character will become what we just said. The word embedding (a string of numbers) is h_1-h_5 in Figure 3, and then they finally become output, such as "I love China" (translation task), which is x_1'-x_3' in Figure 3 . The remaining part in Figure 3 is the attention mechanism, which is A in Figure 3. It is equivalent to assigning a weight to each h, so that we know which words are more important when converting the current word. For specific details, please refer to the article I originally wrote (starting with word2vec and talking about GPT’s huge family tree). It can be seen that the digital representation here is the basis of the entire task, which is why the Attention mechanism cannot work alone.

Starting from GPT-3, continue to write Transformers huge family tree

Figure 3. Early photos - Attention and RNN powerful combination (source: Attention for RNN Seq2Seq Models (1.25x speed recommended) - YouTube)

At this time, as a proud direct relative of the royal family, Transformer does not approve of this dependence on others. In the paper "Attention is all you need" (you Only the attention mechanism is enough) proposed its own independent method in [5], adding a word to "attention mechanism" and turning it into "self-attention mechanism". Only the attention mechanism can generate the string. number. We use traditional Chinese medicine to explain this change. The initial Attention mechanism can be said to be the dosage of each material, but when you finally go to get the medicine, the medicine is in the hands of a medicine picker like RNN or LSTM. Of course, the prescription we prescribe must also be based on the pharmacy (RNN, What medicine is there in LSTM). What Transformer does is just take back the right to collect medicines (add the value matrix), and then change the way of prescribing medicine (add the key and query matrices). At this time, Source can be regarded as a storage box in a traditional Chinese medicine shop. The medicines in the storage box are composed of the address Key (drug name) and value (drug). There is currently a query with Key=Query (prescription), and the purpose is Take out the corresponding Value value (medicine) from the storage box, which is the Attention value. Addressing is done by comparing the similarity between the Query and the address of the element Key in the storage box. The reason why it is called soft addressing means that we do not only find one drug from the storage box, but we may also find it from each Key. The content will be retrieved from the address. The importance of the retrieved content (the amount) is determined based on the similarity between Query and Key. Then the Value will be weighted and summed, so that the final Value value (a pair of Chinese medicine) can be retrieved, that is, Attention value. Therefore, many researchers regard the Attention mechanism as a special case of soft addressing, which is also very reasonable [6].

From then on, Transformer officially began to lead the family to prosperity.

Transformer Succession

In fact, it can be seen from Figure 1 that Transformer is the most prosperous line of descendants in Grandpa’s family, which also confirms that "Attention is all you need” This topic is really reasonable and well-founded. Although I have just talked about what the self-attention mechanism he proposed is, the previous article (starting with word2vec, talking about the huge family tree of GPT) has already talked about the evolution process of transformer in detail. Here is a quick review for new students. Let’s take a look at what the transformer architecture is.

Simply put, we can think of the Transformer as an "actor". For this "actor", the encoder is like the actor's memory, responsible for converting the lines into an The intermediate representation (abstracted into something we don’t know what it is in the mind, that is, the actor’s understanding), and the decoder is like the actor’s performance, responsible for converting the understanding in the mind into a display on the screen. The most important self-attention mechanism here serves as the actor's concentration, which can automatically adjust the actor's attention in different positions, thereby better understanding all lines and enabling him to perform more naturally and smoothly in different situations.

More specifically, we can think of Transformer as a large "language processing factory". In this factory, each worker (encoder) is responsible for processing a position in the input sequence (say a word), processing and transforming it, and then passing it to the next worker (encoder). Each worker has a detailed job description (self-attention mechanism) that details how to process input from the current location and how to establish associations with previous locations. In this factory, each worker can work on his or her own tasks simultaneously, so the entire factory can process large amounts of input data efficiently.

When Transformer appeared on the scene, he won the throne without any suspense directly because of his strong strength and two ambitious sons (BERT and GPT). BERT (Bidirectional Encoder Representations from Transformers) [1] inherited the Encoder part of Transformer and won the first half of the competition, but lost to GPT in terms of versatility due to its limitations. The honest GPT (Generative Pre-trained Transformer) [7-10] inherited the Decoder part, honestly learned from scratch, learned human communication methods, and finally achieved overtake in the second half.

Of course, Transformer’s ambitions obviously don’t stop there. “Attention is all you need” does not refer to just the NLP field. Before introducing the grudges and grudges between GPT and BERT, let’s first take a look at what their father has done.

New Genealogy - Many Princes

"Father, times have changed. Our family will achieve true glory because of my efforts."

——Transformer

After understanding the mechanism of Transformer, we can take a look at how far the Transformer family has developed with the strong development of Transformer. (New Genealogy). As can be seen from the previous "actor" example, Transformer represents a learning method that is consistent with human logic, so it can process not only text, but also images. Figure 2 summarizes the strong family background of the Transformer family. In addition to allowing GPT and BERT to continue to break ground in the original NLP (natural language processing) field, Transformer has also begun to get involved in the field of computer vision. Its younger sons (ViT proposed by Google, etc.) are also shining in this field. In 2021, Vision Transformer ushered in a big explosion, and a large number of work based on Vision Transformer swept computer vision tasks. Naturally, as a family, the Transformer family will always communicate with each other, and CLIP, which connects text and images (AI painting), came into being. At the end of 2022, Stable Diffusion was very popular before ChatGPT. In addition, CLIP also opens new doors to multi-modality for the Transformer family. In addition to words and images, can words also make music, and can they also draw pictures? Multi-modal and multi-task Transformers also emerged. In short, every field is a prince. A Transformer who started from scratch in the NLP field has become a "King of Zhou" who can entrust princes after working hard to develop.

It is a prosperous time when there are many princes.

Starting from GPT-3, continue to write Transformers huge family tree

# Figure 4. The increasingly prosperous family tree of the Transformer family

A little test - Vision Transformer [12]

Before talking about GPT, I still have to talk about the first bold attempt made by Transformer - that is, let the youngest son get involved in the CV field. Let’s take a look at the younger son’s life first:

  • His father, Transformer, was born in a paper called Attention is All You Need in 2017.
  • In 2019, Google proposed a Vision Transformer (ViT) architecture that can directly process images without using a convolutional layer (CNN). The title of the paper is as straightforward as ever: "An image is worth 16x16 words". As shown in Figure 5, its basic idea is to divide the input image into a series of small blocks. Each small block can be understood as a text when processing articles in the past, and then convert these small blocks into vectors, just like in ordinary The Transformer handles text the same way. If in the field of natural language processing (NLP), the attention mechanism of Transformer tries to capture the relationship between different words in the text, then in the field of computer vision (CV), ViT tries to capture the relationship between different parts in the image.

Starting from GPT-3, continue to write Transformers huge family tree

##Figure 5. How ViT processes images (source: Are Transformers better than CNN's at Image Recognition? | by Arjun Sarkar | Towards Data Science)

After that, various Transformer-based models emerged in endlessly, and they were very effective in corresponding tasks. All have achieved results beyond CNN. So what are the advantages of Transformer? Let’s go back to the movie example and look at the difference between Transformer and CNN:

Imagine you are a director. To shoot a movie, you need to position the actors and put different elements in the right places. For example, place the actors against the appropriate background. Use appropriate light to make the entire picture look harmonious and beautiful. For CNN, it is like a professional photographer who captures each frame pixel by pixel and then extracts some low-level features such as edges and textures. Then, it combines these features to form higher-level features, such as faces, actions, etc., and finally gets a frame. As the movie progresses, CNN repeats this process until the entire movie is shot.

For ViT, it is like an art director, who will regard the entire picture as a whole, taking into account factors such as background, light, color, etc., assign each actor The right position and angle creates a perfect picture. ViT then aggregates this information into a vector and processes it using a multi-layer perceptron, resulting in a frame. As the film progresses, ViT repeats this process until the entire film is created.

Back to the image processing task, let’s say we have a 224x224 pixel picture of a cat and we want to classify it using a neural network. If we use a traditional convolutional neural network, it may adopt multiple convolutional and pooling layers to gradually reduce the size of the image, and finally get a smaller feature vector, which is then classified through a fully connected layer. The problem with this method is that during the convolution and pooling process, we gradually lose information in the image because we cannot consider the relationship between all pixels at the same time. In addition, due to the order restriction of convolution and pooling layers, we cannot perform global information exchange. In contrast, if we use the Transformer and the self-attention mechanism to process this image, we can directly treat the entire image as a sequence and perform self-attention calculations on it. This method does not lose any relationship between pixels and allows global information interaction.

In addition, Since the self-attention calculation is parallelizable, we can process the entire image at the same time, greatly speeding up the calculation . For example, suppose we have a sentence: "I like to eat ice cream", which contains 6 words. Now assuming that we are using a model based on the self-attention mechanism to understand this sentence, the Transformer can:

  • Minimize the total computational complexity of each layer: Based on self-attention In the mechanism model, we only need to calculate the attention weight between each word and all other words, so that the calculation amount of each layer only depends on the input length and not the size of the hidden layer. In this example, the input length is 6 words, so the computational complexity of each layer only depends on the number of these 6 words.
  • Maximize the amount of parallelizable calculations: The model based on the self-attention mechanism can simultaneously calculate the attention weight between each word and all other words, so the calculation can be highly parallelized ization, thereby accelerating model training and inference.

However, ViT requires large-scale data sets and high-resolution images to reach its full potential, so while Vision Transformers excel in the field of CV, CNNs perform well in the field of computer vision. The application and research are still more extensive and have advantages in tasks such as target detection and segmentation.

But it doesn’t matter, you have done well enough, and your father’s original intention of getting involved in CV was not to replace CNN, he had a more ambitious goal.

The basis of this goal is the “in addition” I mentioned earlier.

First appearance - CLIP [13]

As I said before, Transformer has a more ambitious goal, which is the "big model", super super big model. In addition to the transformer that I said in the previous article can better obtain global information, smaller computational complexity and better parallelism have become the basis for supporting large models.

In 2021, in addition to the great progress of Vision Transformer, the GPT team is still intensively preparing for GPT3.5. The model worker Transformer, who can’t take any time off, has led to a new climax - connecting text and images. . This climax also fired the first shot for the “big model” project outside the field of NLP. At this time, Transformer's shortcomings in visual tasks have turned into advantages here. "ViT requires large-scale data sets and high-resolution images to reach its full potential." To put it another way, "ViT can handle large-scale data sets and high-resolution images."

Old rules, let’s first talk about what CLIP is.

The full name of CLIP is Contrastive Language-Image Pre-Training. Obviously its basic idea is Contrastive learning in the traditional CV field. When we learn new knowledge, we read different books and articles to get a lot of information. However, we don't just memorize all the words and sentences in every book or article. Instead, we try to find similarities and differences between this information. For example, we might notice that the way a topic is described and the key concepts presented may differ in different books, but the concepts they describe are essentially the same. This way of finding similarities and differences is one of the basic ideas of contrastive learning. We can think of each book or article as a different sample, and books or articles on the same topic can be thought of as different instances from the same category. In contrastive learning, we train the model to learn how to distinguish these different categories of samples to learn their similarities and differences.

Next, a little more academically, let’s say you want to train a model to identify car brands. You could have a set of labeled images of cars, each with a brand label, such as "Mercedes-Benz", "BMW", "Audi", etc. In traditional supervised learning, you feed the image and brand label together into the model and let the model learn how to predict the correct brand label.

But in contrastive learning, you can use unlabeled images to train the model. Suppose you have a set of unlabeled car images, you can divide these images into two groups: positive samples and negative samples. Positive samples are images of the same brand from different angles, while negative samples are images of different brands. Next, you can use contrastive learning to train the model so that positive samples of the same brand are closer to each other and negative samples of different brands are further away from each other. This way, the model can learn to extract brand-specific features from images without having to explicitly tell it the brand label of each image.

Obviously, this is a self-supervised learning model. CLIP is also a similar self-supervised learning model, except that its goal is to connect language and images so that the computer can Understand the relationship between text and images.

Imagine you are learning a set of vocabulary lists where each word has its definition and corresponding image. For each word and its corresponding image, you can think of them as a pair. Your task is to find the correlation between these words and images, i.e. which words match which images and which do not.

As shown in Figure 6, for the contrastive learning algorithm, these word and image pairs are the so-called "anchor" (anchor sample) and "positive" (positive sample). "anchor" refers to the object we want to learn, and "positive" is the sample that matches "anchor". The opposite is "negative" (negative sample), that is, a sample that does not match the "anchor".

In contrastive learning, we pair "anchor" and "positive" and try to distinguish them. We will also pair “anchor” and “negative” and try to distinguish them. This process can be understood as looking for similarities between “anchor” and “positive” and excluding similarities between “anchor” and “negative”.

Starting from GPT-3, continue to write Transformers huge family tree

Figure 6. Illustration of Contrastive Learning [14]. Anchor is the original image. Positives are generally cropped and rotated original images, or known images of the same category. Negatives can be simply and crudely defined as unknown images (possibly of the same category), or already known images. different categories of images.

To achieve this goal, CLIP first pre-trains a large number of images and texts, and then uses the pre-trained model for downstream tasks such as classification, retrieval, and generation. The CLIP model uses a new self-supervised learning method that processes text and images simultaneously and learns how to connect them through training. It shares an attention mechanism between text and images and uses a simple set of tunable parameters to learn this mapping. It uses a transformer-based text encoder and a CNN-based image encoder, and then calculates the similarity between the image and text embeddings. CLIP learns to associate images and text by using a contrastive learning objective that maximizes the consistency between image-text pairs present in the data and minimizes the consistency between randomly sampled image-text pairs.

Starting from GPT-3, continue to write Transformers huge family tree

Figure 7. CLIP illustration [13]. Compared with Figure 6, it can be simply understood that the positive and negative in Figure 6 are both text.

For example, if we want to use CLIP to identify whether a picture is "red beach", we can enter this text description and a picture , CLIP will generate a vector pair to represent their connection. If the distance between this vector pair is very small, it means that the picture may be a "red beach", and vice versa. With this approach, CLIP enables tasks such as image classification and image search.

Back to the full name, the last word of CLIP is pretraining, so its essence is still a pre-training model, but it can be used for various downstream tasks involving matching images and text, such as images Classification, zero-shot learning and image description generation, etc. For example, CLIP can be used to classify images into categories given by natural language labels, such as “photos of dogs” or “landscapes.” CLIP can also be used to generate captions for images by using a language model conditioned on the image features extracted by CLIP. Additionally, CLIP can be used to generate images from text by using generative models conditioned on text features extracted by CLIP.

DALL-E & Stable Diffusion

With the help of CLIP, a new prince has risen - his name is AIGC (AI generated content). In fact, ChatGPT is essentially a type of AIGC, but in this section, we are mainly talking about AI painting. Let’s first take a look at the development history of the small family of AI painting:

  • In 2021.01, OpenAI released DALL-E [15] (AI painting software), which improved GPT-3 Thus allowing GPT-3 to generate images instead of text (Image Transformer Network)
  • Almost at the same time (2021.01), OpenAI released CLIP [13]
  • 2021.05, Google Brain and DeepMind released Stable diffusion [17] and continue to iterate new versions. It employs a frozen CLIP text encoder to adapt the model based on textual cues. Stable diffusion decomposes the image generation process into a runtime "diffusion" process. Starting with only noise, it gradually corrects the image until there is no noise at all, making it closer to the provided text description.
  • 2022.04, DALL-E-2 [16] released. It can create realistic images and artwork based on natural language descriptions. DALL-E-2 employs a two-part model consisting of a prior and a decoder. The prior is a GPT-3 model that generates CLIP image embeddings based on textual hints. The decoder is a diffusion model that generates images based on CLIP embeddings. DALL-E-2 can also perform outpainting, inpainting, and changes to existing images.

The lineage of this family can be seen. The eldest brother CLIP connected images and text, and his twin brother DALL-E took the opportunity to propose the task of text to image. In order to improve this task, a distant cousin, Stable diffusion, improved the algorithm for generating images. Finally, DALL-E-2 learned from each other and combined the advantages of GPT-3, CLIP and stable diffusion to complete its own AI painting system.

For the original DALL-E, assume you are a painter, and DALL-E is your toolbox. In this metaphor, there are two main tools in the toolbox: one is the brush and the other is the palette.

Brush is DALL-E's decoder that converts a given text description into an image. The palette is DALL-E's encoder, which can convert any text description into a feature vector.

When you get a text description, you will first use the color palette to generate a feature vector. You can then take your paintbrush and use the feature vectors to generate an image that matches the description. You'll use a finer brush when you need detail, and a coarser brush when you don't.

Unlike a painter, DALL-E uses neural networks instead of brushes and palettes. This neural network uses a structure called Image Transformer Network. When generating images, DALL-E uses the previously mentioned GPT-3 model to generate CLIP image embeddings corresponding to textual descriptions. DALL-E then uses a beam search algorithm to generate a sequence of possible images that match the input text description and feeds them into a decoder to generate the final image. This embedding vector is trained by using a technique called contrastive learning, which embeds similar images and text into adjacent spaces so that they can be combined more easily. Note that DALLE here does not directly include CLIP, but it uses CLIP's text and image embeddings to train the transformer and VAE.

As for the beam search algorithm used in the process of generating images, it is actually a greedy search algorithm that can find the optimal sequence in a limited set of candidates. The basic idea of ​​beam search is that each time the current sequence is expanded, only the k candidates with the highest probability are retained (k is called the beam width), and other low-probability candidates are discarded. This reduces the search space and improves efficiency and accuracy. The specific steps for using beam search to generate images in DALLE are as follows:

  • #Encode the input text description into a vector and serve as the initial input of the transformer model.
  • Generate an image sequence pixel by pixel starting from a special start symbol. Each time a pixel is generated, the transformer model is used to predict the probability distribution of the next pixel, and the k candidate pixels with the highest probability are selected as an extension of the current sequence.
  • For each extended sequence, calculate its cumulative probability, retain the k sequences with the highest probability, and discard other sequences.
  • Repeat steps 2 and 3 until a special end symbol is generated or the maximum length limit is reached.
  • Return the sequence with the highest probability as the final generated image.

The same painting, how to draw stable diffusion? When we want to paint a work of art, we usually need a good composition and some specific elements to build it from. Stable diffusion is such a method of generating images, which divides the image generation process into two parts: the diffusion process and the reconstruction process. Think of the diffusion process as mixing together a bunch of scattered brushes, paints and a canvas, slowly creating more and more elements on the canvas. During this process, we didn’t know what the final picture would look like, nor could we determine the final position of each element. However, we can gradually add and adjust these elements until the entire painting is complete. Then, the input text description is like a rough description of the work we want to draw, and a beam search algorithm is used to perform a fine match between the text description and the generated image. This process is like we are constantly modifying and adjusting elements to make them better match the picture we want. Ultimately, the resulting image will closely match the text description, rendering the work of art we imagined.

As shown in Figure 8, the diffusion model here is a generative model that learns the distribution of data by gradually adding noise to the data and then reversing the process of restoring the original data. stable diffusion uses a pretrained variational autoencoder (VAE) to encode images into low-dimensional latent vectors, and a transformer-based diffusion model to generate images from the latent vectors. stable diffusion also uses a frozen CLIP text encoder to convert text cues into image embeddings to condition the diffusion model.

Starting from GPT-3, continue to write Transformers huge family tree

Figure 8. Stable Diffusion process. The first is the upper arrow. Noise is continuously added to a picture, and finally it becomes a pure noise image. Then the lower arrow is used to gradually eliminate the noise, and then reconstruct the original picture. (Source: From DALL・E to Stable Diffusion: how do text-to-image generation models work? | Tryolabs)

It is worth noting that the diffusion process in Stable Diffusion is a random process, so the generated image will be different every time, even if it is the same text description. This randomness makes the generated images more diverse and also increases the uncertainty of the algorithm. In order to make the generated images more stable, Stable Diffusion uses some techniques, such as adding gradually increasing noise during the diffusion process, and using multiple reconstruction processes to further improve the image quality.

Stable Diffusion has made great progress based on DALL-E:

  • Resolution: stable diffusion can be generated Up to 1024×1024 pixel images, while DALL-E can currently only generate 256×256 pixel images.
  • Speed: stable diffusion requires multiple iterations to generate an image and is therefore slower. DALL-E can generate images in one go, so it is faster.
  • Flexibility: stable diffusion can expand, patch, and change existing images, while DALL-E can only generate images from text prompts.
  • Authenticity: stable diffusion can produce more realistic and detailed images, especially with complex and abstract descriptions. DALL-E may generate some images that do not comply with physical laws or common sense.

This is why DALL-E-2 also added the diffusion model to its model.

The latent strongman-GPT3.5 [18]

& Instruct GPT [19]

While other princes are carrying out reforms in full swing At that time, the GPT team has also been working hard quietly. As mentioned at the beginning, GPT-3 already had strong capabilities when it was first released, but its use method was not so "non-technical friendly", so the waves it caused were all in the technical world, which were not very enthusiastic in the first place. , and it is increasingly dissipating because of its high fees.

Transformer is very dissatisfied. GPT thought about it and reformed it!

The first one to respond to the call for reform and take the first step was GPT 3.5:

## "I'm stupid and can't think of any good way to reform, so let's lay a solid foundation first."

So, GPT3.5 is based on GPT-3 and uses A type of training data called Text Code, which is based on text data, adds some programming code data. Simply put, a larger data set is used. This allows the model to better understand and generate code, increasing the diversity and creativity of the model. Text Code is text and code-based training data that is collected and organized from the web by OpenAI. It consists of two parts: text and code. Text is content described in natural language, such as articles, comments, conversations, etc. Code is something written in a programming language like Python, Java, HTML, etc.

Text Code training data can enable the model to better understand and generate code, improving the diversity and creativity of the model. For example, in programming tasks, the model can generate corresponding code based on text descriptions, and the code has high correctness and readability. In the content generation task, the model can generate corresponding text based on the code description, and the text has high consistency and interest. Text Code training data can also enable the model to better handle multi-language, multi-modal, and multi-domain data and tasks. For example, in language translation tasks, the model can perform accurate and smooth translation based on the correspondence between different languages. In the image generation task, the model can generate corresponding images based on text or code descriptions, and the images have high clarity and fidelity.

The second person to respond to the call was Instruct GPT, who discovered a new problem:

"If we want to be at one with human beings, we need to listen to their opinions more effectively."

As a result, the famous new foreign aid appeared, which is the RLHF training strategy. RLHF is a training strategy based on reinforcement learning, and its full name is Reinforcement Learning from Human Feedback. Its core idea is to provide some instructions to the model during the training process and give rewards or penalties based on the model's output. This allows the model to better follow instructions and improves the model's controllability and credibility. In fact, GPT-3.5 also has human feedback. So what changes have taken place after adding reinforcement learning (Reinforcement learning)?

  • The human feedback of GPT3.5 is directly used to fine-tune the parameters of the model, while the RLHF of Instruct GPT is used to train a reward model, and then use this reward model to guide model behavior.
  • GPT3.5's human feedback is based on the evaluation of a single output, while Instruct GPT's RLHF is based on comparisons between multiple outputs.
  • GPT3.5’s human feedback is only performed once, while Instruct GPT’s RLHF can be iterated multiple times, constantly collecting new comparison data, training new reward models, and optimizing new Strategy.

#In other words, less human investment is required, but it brings greater benefits to the model.

Starting from GPT-3, continue to write Transformers huge family tree

##Picture 9. RLHF process (Source: GPT-4 (openai.com))

As shown in Figure 9, the RLHF training strategy is divided into two stages: Pre-training and fine-tuning. In the pre-training stage, the model uses the same data set as GPT-3 for unsupervised learning to learn the basic knowledge and rules of the language. In the fine-tuning phase, the model uses some manually labeled data for reinforcement learning to learn how to generate appropriate outputs based on instructions.

Manually labeled data consists of two parts: instructions and feedback. Instructions are tasks described in natural language, such as "Write a poem about spring" or "Tell me a joke about a dog." Feedback is a numerical rating, such as "1" for poor or "5" for excellent. Feedback is given by human annotators based on the model's output and reflects the quality and reasonableness of the model's output.

In the fine-tuning phase, the model uses an algorithm called Actor-Critic for reinforcement learning. The Actor-Critic algorithm consists of two parts: Actor and Critic. An Actor is a generator that produces output based on instructions. Critic is an evaluator that evaluates the output reward value based on feedback. Actors and Critics collaborate and compete with each other, constantly updating their parameters to increase reward values.

The RLHF training strategy can make the model follow instructions better and improve the controllability and credibility of the model. For example, in writing tasks, the model can generate texts of different styles and themes according to instructions, and the texts have high coherence and logic. In conversational tasks, the model can generate responses with different emotions and tones based on instructions, and the responses are highly relevant and polite.

Finally, after the reform and accumulation of its predecessors, ChatGPT, the more flexible younger son of the GPT family, felt that it was time, and took advantage of the trend to launch a dialogue mode that is more in line with human communication methods based on Instruct GPT. , directly set off a huge wave in human society (hundreds of millions of users), and it is free. After several years of dormancy, the GPT family finally became a blockbuster and became the most favored prince of the Transformer family, directly jumping into the succession battle. Win the first prize and become the prince.

At the same time, for ChatGPT, the prince is not everything. ChatGPT inherits the huge ambition of Transformer:

"The current situation It’s too chaotic. A powerful dynasty doesn’t need so many princes. It’s time to unify them. “

Unify the princes – Big Model Era

GPT-4:” This The era is the era of large models, I said." (bushi)

The current ChatGPT is already based on GPT-4. Because GPT-4 is afraid of the rapid response of its competitors, most of the technical details are actually closed. However, from its functions, the ambition of the GPT family to unify various princes has been seen. In addition to text dialogue, GPT-4 also added AI mapping functions. The GPT family has realized a truth from its dormant experience in the past few years, that big models are justice, and wants to extend this truth to various fields.

If we delve deeper into the reasoning behind this principle, it may be the way of training large models. GPT-3 is one of the largest language models currently. It has 175 billion parameters, 100 times more than its predecessor GPT-2, and 10 times more than the previous largest similar NLP model. It can also be regarded as the pioneer of the big prediction model. or.

So, let’s first take a look at how GPT-3’s model architecture and training methods achieve such scale and performance:

  • Distributed training: GPT-3 uses a distributed training method, which means that the model and data are dispersed on multiple computing nodes and coordinated and synchronized through communication protocols. This can utilize the computing resources and memory space of multiple nodes to speed up the model training process and support larger-scale models and data.
  • GPT-3 uses about 2000 GPU nodes for distributed training. Each node has multiple GPUs, and each GPU has the same video memory.
  • GPT-3 uses two distributed training methods: data parallelism and model parallelism.
  • Data parallelism refers to dividing the data into multiple subsets, each node processes one subset, updates the parameters of the model on each node, and then synchronizes the parameters among all nodes.
  • Model parallelism refers to dividing the model into multiple parts, each node processes one part, and calculates the output and gradient of the part on each node, and then passes the output among all nodes and gradient.
  • GPT-3 uses a hybrid data parallelism and model parallelism approach, that is, data parallelism is used within each node and model parallelism is used between different nodes. This can fully utilize the computing power and communication bandwidth of the GPU while reducing communication overhead and memory usage.
  • Activation function checkpoint: GPT-3 uses a technology called activation function checkpoint, that is, during the forward propagation process of the model, Only save the activation function values ​​of some layers, not the values ​​of all layers. This can save video memory space, because the value of the activation function takes up most of the video memory. During the backpropagation process of the model, if the values ​​of the activation functions of certain layers need to be used, they are recalculated instead of reading them from the video memory. This sacrifices some computation time in exchange for more video memory space, allowing for larger models and batch sizes.
  • Sparse attention mechanism: GPT-3 uses a technology called sparse attention mechanism, that is, when calculating self-attention, only the words in part of the input sequence are considered, and Not all words. This can reduce the amount of calculation and memory usage, because the complexity of self-attention is squarely related to the length of the input sequence. GPT-3 uses a sparse attention mechanism based on local windows and global blocks, which divides the input sequence into multiple blocks, and each block only performs attention calculations with several adjacent blocks, and each block also Attention computation with some randomly selected global blocks. This ensures that the model can capture both local and global information, while also reducing computational complexity and memory usage.

Seeing this, ChatGPT frowned slightly, seemingly dissatisfied with the GPT-3 plan: "This is not enough."

"Large models are indeed the current trend, but we should not blindly pursue scale just for the sake of competition. Before training a large model, we need to consider more details and technical challenges to ensure that it can run stably and efficiently, and Produce useful results."

"First of all, choosing appropriate training hyperparameters and model initialization is very critical. The selection of hyperparameters such as learning rate, batch size, and number of iterations is critical to the convergence of the model Speed, stability and performance have a major impact. Model initialization determines the weight values ​​before training starts, which will affect the quality of the final results. These parameters need to be carefully adjusted based on empirical experiments or theoretical analysis to ensure the best performance of the model .”

"Secondly, in order to obtain high throughput and avoid bottlenecks, we need to optimize various aspects of the training process, such as hardware configuration, network bandwidth, data loading speed, model architecture, etc. Optimizing these aspects can significantly improve The processing speed and efficiency of the model. For example, using a faster storage device or data format can reduce data loading time; using a larger batch size or gradient accumulation can reduce communication overhead; using a simpler or sparser model can reduce calculation time Etc."

"Finally, when training large models, you may encounter various instability and failure situations, such as numerical errors, overfitting, hardware failures, and data quality issues. And so on. In order to avoid or recover from these problems, we need to closely monitor the behavior and performance of the model and use debugging tools and techniques to identify and fix any errors or defects. In addition, we can also use various safety measures and protection mechanisms, such as Clipping, regularization, discarding, noise injection, data filtering, data enhancement, etc. to improve the robustness and reliability of the model."

"In this era, large models are indeed is important, but simply pursuing scale will not allow the model to produce useful results. Only through thoughtful training and optimization can large models truly realize their potential and bring more value to humans."

The prince is right.

The decline of powerful princes - BERT

In the end, a skinny camel is bigger than a horse. Although BERT has been overshadowed by GPT recently, it is still a powerful prince after all. , under the unstoppable development of GPT, BERT still retains its own fiefdom. When talking about natural language processing models, BERT (Bidirectional Encoder Representations from Transformers) was once a very popular model because it performed very well on many tasks. When it was first released, it was almost unbeatable, even more successful than GPT. This is because BERT is designed with different goals and advantages than GPT.

BERT’s goal is to push the capabilities of context modeling to a whole new level to better support downstream tasks such as text classification and question answering. It achieves this goal by training a bidirectional Transformer encoder. This encoder is able to consider both the left and right sides of the input sequence, resulting in a better context representation, so BERT can better model the context, improving the model's performance in downstream tasks.

However, over time, the emergence of the GPT series of models allowed GPT-3 to surpass BERT on multiple tasks. One possible reason is that the models of the GPT series are designed to focus more on generative tasks, such as text generation and dialogue systems, while BERT focuses more on classification and question and answer tasks. In addition, the GPT series models use larger parameters and more data for training, which also enables them to achieve better performance on a wider range of tasks.

Of course, BERT is still a very useful model, especially for some tasks that require classifying text or answering questions. The GPT series of models are more suitable for generative tasks, such as text generation and dialogue systems. Overall, both models have their unique advantages and limitations, and we need to choose the appropriate model based on the needs of the specific task.

The battle for the direct descendant - the menacing Segment Anything Model (SAM) [20]

As mentioned before, while the elder brother GPT is working hard silently, the model worker Transformer is Both the CV field (ViT) and the multimodal field (CLIP) caused quite a stir, but in the end they both became experience babies. They were taught by the old father Transformer to the favored prince GPT, and finally achieved the so-called GPT-4. Great unification.

ViT and CLIP, who have Transformer blood flowing in their bones, are certainly not happy: "Do princes and generals have the guts to do this? Isn't the eldest brother learning from us? We can also learn from him."

"However, he is too powerful in the field of NLP. We need to find a new battlefield."

So, SAM was born. On the official website, they describe it like this:

Segment Anything Model (SAM): a new AI model from Meta AI that can "cut out" any object, in any image, with a single click

Simply put, we can think of SAM as an efficient "image editing master" that can accurately identify and segment various objects in images through various input prompts. For example, when we click a point in the image with the mouse, SAM will automatically cut out the object where the point is located like an experienced painter; when we enter the word "cat", SAM will act like a smart Like a detective, we automatically find and cut out all the cats in the image; when we give SAM a target detection frame, SAM will accurately cut out the objects in the frame like a skilled surgeon. SAM’s zero-sample generalizationability makes it a true “universal editing master”. This means that whether they are common objects like cars, trees and buildings, or rare objects like dinosaurs, aliens and magic wands, SAM can identify and cut them effortlessly. This powerful capability stems from its advanced model design and large data set. I selected four very complex scene examples from the original paper (Figure 10) to illustrate what SAM can do.

Starting from GPT-3, continue to write Transformers huge family tree

Figure 10. Example of the effect of SAM. You can edit and extract every color in the picture, which is equivalent to an efficient PS master (image editing master).

To put it simply, when others came to us excitedly to make demands, we always had to ask helplessly, wait a minute, what kind of data can you provide? No need now, At least in the CV field, it is closer to the non-technical crowd’s understanding of AI.

In order to realize the powerful capabilities mentioned above, let’s take a look at how ViT and CLIP conspired loudly:

ViT: “Although I I mainly did image classification tasks before, but my architecture is also suitable for image segmentation. Because I use the Transformer architecture to decompose the image into a series of blocks and then process them in parallel. If I integrate my advantages, SAM can inherit my parallelism Processing and global attention advantages, thereby achieving efficient image segmentation."

CLIP: "Okay, then I will take my joint training method to invest. Based on this idea, SAM can also handle different types of input prompts (question prompts and visual prompts)."

So, the SAM model architecture was formed (Figure 11), and ViT was used as the image encoder ( image encoder), and CLIP to encode prompt information. The idea is good, but how to do it - of course, learn from the big brother!

"We want to use pre-trained language models for image segmentation tasks, just like using text prompts (prompt) to let the language model generate or predict text. With CLIP, Our hints can be very rich, which can be some points, boxes, masks, and Text, which tell the language model what to segment in the image. Our goal is, given For any prompt, you can get a valid segmentation mask (segmentation result). A valid mask means that even if the prompt is ambiguous (for example, a shirt or a person), the output should be a reasonable mask for one of the objects. This is like the big brother GPT (Language model) can also give a coherent response to an ambiguous prompt. We choose this task because it allows us to pre-train the language model in a natural way and achieve zero-shot transfer through prompts to different segmentation tasks."

Starting from GPT-3, continue to write Transformers huge family tree

Figure 11. SAM model architecture

As for the results, its powerful capabilities mentioned earlier have confirmed the feasibility of this idea. However, it must be mentioned that although SAM does no longer need to retrain the model, it still has some limitations like when chatGPT was first launched. In the Limitation section of the paper, the author page clearly points out some limitations and shortcomings of SAM, such as defects in details, connectivity, boundaries, etc., as well as in tasks such as interactive segmentation, real-time, text prompts, semantics, and panoramic segmentation. challenges, while also acknowledging the advantages of some domain-specific tools.

For example, I did two simple tests in the demo: one is lesion detection in the field of medical images, because the lesions are too small and difficult to detect; the second is portrait cutting. The resulting portrait looks good at first glance, but the hair is still not very natural, and cutting marks can still be seen if you look closely.

Of course, this is a good start after all. These two guys have only just started their business and are still working hard. What kind of bicycle do they want? So, let us wait and see what the outcome of this heirloom will be!

Summary

The huge family of Transformer is obviously not something that can be explained in this article. When it comes to results based on Transformer, we can see the progress in this field Continuous innovation: Vision Transformer (ViT) demonstrates the successful application of Transformer in the field of computer vision, which can directly process image pixel data without manual feature engineering. DALL-E and CLIP applied Transformer to image generation and image classification tasks, demonstrating its superior performance in visual semantic understanding. Stable Diffusion proposes a stable diffusion process that can model probability distributions, which can be applied to tasks such as image segmentation and generation. These results jointly reveal the broad application prospects of the Transformer model, and we have to admit that one day in the future, "Attention is all you need."

In short, we can see from these results the vitality of continued innovation in the field of artificial intelligence. Whether it is GPT or BERT, or Vision Transformer, DALL-E, CLIP, Stable diffusion, etc., these achievements represent the latest progress in the field of artificial intelligence.

As for the big exam (ChatGPT), the current situation is probably like this:

Top academics, take classes well this semester and open the books I can recall the teacher’s voice and smile when he talked about this knowledge point in that class, and even start to plan the study plan for the next semester.

The pseudo-academic masters come to class every day and occupy the front row. When they open the textbooks, they are confused. They start to follow the bad students to "one book a day, one semester a week". The only difference is that the textbook is not brand new, and I still have a little memory of the textbook content, which does not count as completely learning new knowledge.

As for the real scumbags...

"Knowledge comes, knowledge comes, knowledge comes from all directions"

In fact, I think that whether you are a fake academic master or a scumbag, you should stay calm in front of the final exam, take a look at what was taught this semester, borrow notes from the academic masters, and even choose Defer the exam. For top academics, speed comes naturally. For fake academics and scumbags, speed is harmful.

In the competition in the field of artificial intelligence, continuous innovation is crucial. Therefore, as researchers, we should pay close attention to the latest developments in this field and maintain a humble and open mind to promote the continuous progress of the field of artificial intelligence.

The above is the detailed content of Starting from GPT-3, continue to write Transformer's huge family tree. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete