Home >Technology peripherals >AI >Transformers Review: From BERT to GPT4

Transformers Review: From BERT to GPT4

王林
王林forward
2023-05-10 23:25:041251browse

Artificial intelligence has become one of the most talked about topics in recent years, and services that were once considered purely science fiction are now becoming a reality thanks to the development of neural networks. From conversational agents to media content generation, artificial intelligence is changing the way we interact with technology. In particular, machine learning (ML) models have made significant progress in the field of natural language processing (NLP). A key breakthrough is the introduction of "self-attention" and the Transformers architecture for sequence processing, which allows several key issues that previously dominated the field to be solved.

Transformers回顾 :从BERT到GPT4

In this article, we will look at the revolutionary Transformers architecture and how it is changing NLP, we will also take a comprehensive review of Transformers from BERT to Alpaca models, highlighting the main characteristics of each model and its potential applications.

Bert-like text model

The first part is a model based on the Transformer encoder, which is used for vectorization, classification, sequence labeling, QA (Question and Answer), NER (Named Entity Recognition), etc.

1. BERT Google / 2018

Transformer encoder, wordpiece tokenization (30K vocabulary). The input embedding consists of three vectors: a label vector, a trainable position vector, and a fragment vector (either first text or second text). The model inputs are the CLS token embeddings, the embedding of the first text, and the embedding of the second text.

BERT has two training tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, 15% of tokens are masked, 80% are replaced by MASK tokens, 10% are replaced by random tokens, and 10% remain unchanged. The model predicts the correct tokens, and the loss is calculated only on these 15% of blocked tokens. In NSP, the model predicts whether the second text follows the first text. Predictions are made on the output vector of CLS tokens.

To speed up training, first 90% of the training is performed on a sequence length of 128 tokens, and then the remaining 10% of the time is spent training the model on 512 tokens to obtain more effective position embeddings.

2, RoBERTa Facebook / 2019

An improved version of BERT, which is only trained on MLM (because NSP is considered less useful), and the training sequence is longer (512 tokens). Using dynamic masking (different tokens are masked when the same data is processed again), the training hyperparameters are carefully chosen.

3. In the original XLM, all languages ​​had a shared BPE vocabulary.

XLM has two training tasks: MLM and translation. Translation is essentially the same as MLM on a pair of texts, but the texts are parallel translations of each other, with random masks and segment embedding encoding languages.

4. Transformer-XL Carnegie Mellon University / 2019

This model is designed to process long sequences and has two main ideas: loop processing of fragments and relative position encoding.

Long text is divided into segments and processed one segment at a time. The output of the previous segment is cached, and when calculating the self-attention in the current segment, the keys and values ​​are calculated based on the output of the current segment and the previous segment (just concatenated together). Gradient is also calculated only within the current segment.

This method does not work with absolute positions. Therefore, the attention weight formula is re-parameterized in the model. The absolute position encoding vector is replaced by a fixed matrix based on the sine of the distance between marker positions and a trainable vector common to all positions.

5. ERNIE Tsinghua University, Huawei / 2019

Embed information about named entities in the knowledge graph into BERT. The input consists of a set of text tokens and a set of entity tokens (each token represents an entire entity). Text tokens are encoded by BERT. On top of BERT, there is a set of K encoder blocks (accounting for about 3% of the network parameters). In these blocks:

The update vector of the text tag and the original vector of the entity tag are first calculated independently;
  • The entity vectors are compared with the first time they appear in the text tags match;
  • is activated using GeLU and used to obtain a new hidden representation of the text tag;
  • The new vectors of text and entity tags are obtained from the hidden representation and passed as input to Next encoder block.
  • During pre-training, three losses are calculated: MLM, NSP and entity prediction from tokens (such as autoencoders). The autoencoders use the following rules:

In 5% of the cases, the entity is replaced with the wrong entity, but the match is retained, the model must predict the correct entity;
  • In 15% of the cases, the match is removed, the model must only Predict entities based on text;
  • is normal in other cases.
  • Pre-trained models can be fine-tuned like regular BERT models (with CLS tokens). Additional procedures can also be used for fine-tuning to determine relationships between entities and their types.

6, XLNet Carnegie Mellon University / 2019

Because there are problems in the BERT training process:

  • During training, the loss calculation only calculates the mask marks.
  • Only individual markers are masked, and the prediction of one masked marker will not affect the prediction of other markers.
  • There are no MASK tokens in real applications that the model actively sees during training.

XLNet is based on Transformer-XL, except for replacement language modeling (PLM) tasks, where it learns to predict tokens in short contexts instead of using MASK directly. This ensures that gradients are calculated for all markers and eliminates the need for special mask markers.

The tokens in the context are scrambled (for example: the i-th token can be predicted based on the i-2 and i-1 tokens), but their positions are still known. This is not possible with current positional encodings (including Transformer-XL). When trying to predict the probability of a token given part of a context, the model should not know the token itself, but should know the token's position in the context. To solve this problem, they split the self-attention into two streams:

  • At each marker position, there are two vectors instead of one: the content vector and the query vector.
  • The content vector contains complete information about the token, while the query vector only contains location information.
  • Both vectors of token are calculated based on the context vector, but the query vector in self-attention is calculated using the past content vector, and the content vector is calculated using the past query vector.
  • The query vector does not receive information about the content of the corresponding token, but knows all the information about the context, while the content vector contains complete information.

During fine-tuning, if you ignore the query vector, the model will work like a regular Transformer-XL.

In practice the model requires that the context must be long enough for the model to learn correctly. It learned on the same amount of data as RoBERTa with similar results, but due to the complexity of the implementation, the model did not become as popular as RoBERTa.

7, ALBERT Google / 2019

Simplify BERT without sacrificing quality:

  • Use common parameters in different encoder blocks, and It has been shown that the weights of self-attention can be shared, but splitting the weights of fully connected layers leads to a loss of quality.
  • Compared with BERT, smaller input embeddings and larger hidden layer vectors are used. This can be achieved by using an additional projection matrix at the network input, which also decouples the size of the embedding from the size of the hidden representation.
  • The parameters of the model are reduced by 18 times, and the running speed is increased by 1.7 times.

The model is trained on MLM and Sentence Order Prediction (SOP).

8, DistilBERT Google / 2019

Another way to optimize BERT is distillation:

  • The number of encoder blocks is halved
  • Three loss components: MLM, cross-entropy with the teacher model output, and cosine distance between the corresponding layer outputs.
  • The model is 40% smaller and 60% faster than the teacher model, and maintains 97% quality across a variety of tasks.

9, LaBSE Google / 2020

Multi-language vectorization model based on BERT. It is trained on MLM and TLM (20% of markers are masked) and then fine-tuned. It supports over 100 languages ​​and contains a 500K tagged vocabulary.

10, ELECTRA Google, Stanford University / 2020

Use generative adversarial methods to accelerate BERT training:

  • Trained two BERT-like models: a small generator and a main discriminator
  • The generator is trained on the MLM and then filled with masked tokens
  • The discriminator is trained to predict the originality of the text generated by the generator (replacement detection task )
  • After training is completed, remove the generator and fine-tune with the discriminator

The amount of training data is the same as RoBERTa or XLNet, and the model is faster than BERT, RoBERTa and ALBERT Learn to a similar level of quality. The longer it is trained, the better it performs.

11, DeBERTa Microsoft / 2020

Another model that separates the content and position of the marker vector into two separate vectors:

  • The position vector is in Shared between all layers and are relative, i.e. there is one for every possible distance between markers.
  • Added two new weight matrices K_pos and Q_pos for them.
  • Modify the attention weight calculation and simplify it to the sum of three products: Q_cont * K_cont Q_cont * K_pos K_cont * Q_pos
  • As in ALBERT, use the projection matrix to combine the embedding size and hidden size Markers represent size decoupling of vectors.

A model similar to GPT and T5

A model based on complete Transformers. Its range of applications is very broad: in addition to the tasks of the previous section, it includes conversational agents, machine translation, logical and mathematical reasoning, code analysis and generation, and basically text generation. The largest and "smartest" models are usually based on decoder architectures. Such models often perform well in few-shot and zero-shot modes without fine-tuning.

1, GPT-2 OpenAI / 2018

The decoder is trained on the task of causal LM (predicting the next token based on the left context). From an architectural perspective, there are some minor changes: removing the cross-attention layer from each decoder block and using LayerNorm

The tokenizer used is byte-level BPE (50K vocabulary), and similar substrings such as ("dog", "dog!", "dog.") are not used. The maximum sequence length is 1024. The layer output caches all previously generated tags.

2, T5 Google / 2019

Full pre-training on MLM (15% of tokens masked), spans masked by code (, ,...) are blocked. Output prediction sequence< Use relative position encoding: Positions are encoded by learnable embeddings, where each "embedding" is just a scalar that adds the corresponding logit when calculating the attention weights.

Matrix B is shared across layers, but is different for different self-attention heads.

Each layer considers 128 distances between tokens and zeros out the rest, which allows for inference on longer sequences compared to those seen during training.

Tokenization is done using sentencepece (32K vocabulary), with a maximum sequence length of 512 during pre-training.

3, BART Facebook / 2019

Another complete transformers, but using GeLU instead of ReLU. Train it to predict original text from noisy text (AE denoising) with the following noise types:

Token Masking

Remove Token

    Token Filling
  • Reverse the order of tokens in the sentence
  • Make a random token the beginning of the sequence
  • Use byte-level BPE (vocabulary size is 50K)
  • 4. CTRL Salesforce / 2019
Use a prefix code token (for example,

input text…) to control the generated decoder. Codes are assigned to appropriate text during training and then used during inference to generate correspondingly styled text. The model is trained on causal LM and no additional loss is used. The tokenization used is BPE and the vocabulary size is 250K.

5, GPT-3 OpenAI / 2020

This is a GPT-2 model with Sparse Transformer architecture and an increased sequence length of 2048 tokens. Do you still remember that sentence: Don’t ask, just ask: GPT3

6 and mT5 Google / 2020

are based on the T5 model and have similar training, but use multi-language data. ReLU activations were replaced with GeGLU and the vocabulary was expanded to 250K tokens.

7, GLAM Google / 2021

This model is conceptually similar to Switch Transformer, but focuses more on working in a few-sample mode rather than fine-tuning. Models of different sizes use 32 to 256 expert layers, K=2. Use relative position encoding from Transformer-XL. When processing tokens, less than 10% of network parameters are activated.

8, LaMDA Google / 2021

A model similar to gpt. The model is a conversational model pre-trained on causal LM and fine-tuned on generation and discriminative tasks. The model can also make calls to external systems (search, translation).

9, GPT-NeoX-20B EleutherAI / 2022

This model is similar to GPT-J and also uses rotational position encoding. Model weights are represented by float16. The maximum sequence length is 2048.

10. BLOOM BigScience / 2022

This is the largest open source model in 46 languages ​​and 13 programming languages. To train the model, a large aggregated dataset called ROOTS is used, which includes approximately 500 open datasets.

11, PaLM Google / 2022

This is a large multi-language decoder model, trained using Adafactor, disabling dropout during pre-training, and using 0.1 during fine-tuning.

12, LLaMA Meta / 2023

An open source large-scale gpt-like LM used for scientific research and has been used to train multiple instruction models. The model uses pre-LayerNorm, SwiGLU activation and RoPE position embedding. Because it is open source, this is one of the main models for overtaking in corners.

Guidance Models for Text

These model captures are used to calibrate model outputs (e.g. RLHF) to improve response quality during dialogue and task resolution.

1, InstructGPT OpenAI / 2022

This work adapts GPT-3 to follow instructions efficiently. The model is fine-tuned on a dataset consisting of hints and answers that humans consider good based on a set of criteria. Based on InstructGPT, OpenAI created a model we now know as ChatGPT.

2. Flan-T5 Google / 2022

A guidance model suitable for T5. In some tasks, the Flan-T5 11B outperformed the PaLM 62B without this fine-tuning. These models have been released as open source.

3, Sparrow DeepMind / 2022

The basic model is obtained by fine-tuning Chinchilla on selected high-quality conversations, with the first 80% of the layers frozen. The model was then further trained using a large prompt to guide it through the conversation. Several reward models are also trained on top of Chinchilla. The model can access a search engine and retrieve snippets of up to 500 characters that can become responses.

During the inference process, the reward model is used to rank candidates. Candidates are either generated by the model or obtained from the search, and then the best one becomes the response.

4, Alpaca Stanford University / 2023

The above guidance model of LLaMA. The main focus is on the process of building a dataset using GPT-3:

  • The goal is to obtain a set of Task-Input-Output triples, where Input can be empty.
  • Humans generate 175 task prompts with answers, which are fed into GPT-3, and GPT-3 generates new tasks.
  • The generation process is iterative, and at each step, some task examples from humans and some from previously generated task examples are provided.
  • GPT-3 divides the generated tasks into classification tasks or non-classification tasks, and generates different inputs and outputs based on this.
  • Triples are filtered based on quality and dissimilarity to existing triples in the database.

A total of 52K unique triples were generated and fine-tuned for LLaMA 7B.

5, Koala Berkeley University / 2023

This is a fine-tuning of LLaMA on the instruction data, but unlike the Alpaca above, it is not only generated by large models such as GPT-3 Fine-tune the data. The composition of the dataset is:

  • 30k explanation and answer samples about mathematics, poetry and conversation;
  • 52K samples from the Alpaca dataset;
  • 160K responses to models with user preferences for usefulness and harm;
  • 20K responses to models with user questions and ratings;
  • 93K summaries with user ratings for their quality;

There is no quality increase compared to GPT-3. But in blind tests, users preferred Koala's answers to Alpaca's.

Model to generate images from text

Image generator based on text description. Diffusion models combined with transformers dominate this field, enabling not only image generation but also content manipulation and resolution enhancement.

1, DALL-E OpenAI / 2021

This work is carried out in two stages: training on labeling of images, and then learning a joint generative model of text and images.

In the first stage, dVAE is trained, where the image is transformed from 256x256x3 space to 32x32xdim and back, where dim is the dimension of the hidden representation vector. There are a total of 8192 such marker vectors, which will be used further in the model.

The main model used is the sparse transformer decoder. Taking text tokens and image tokens as input, the model learns a joint distribution (Causal LM), after which image tokens can be generated based on text. dVAE generates an image based on these same tokens. The weight loss for text tags is 1/8, and the weight loss for image tags is 7/8.

For text tags, there are regular embeddings and positional embeddings, and for image tags, there are regular, column-positioned and row-positioned embeddings. The maximum length of a text token sequence is 256, and the tokenization is BPE (16K vocabulary).

2, GLIDE OpenAI / 2021

A diffusion model (DM) that operates at the pixel level and is controlled by text. It is based on U-Net architecture with convolution, attention and residual connections. Use different methods to control generation. Scalar product of image vectors and text vectors obtained using CLIP

3, Latent Diffusion [Stable Diffusion] CompVis [Stability AI] / 2021 [2022]

Diffusion model working in pixel space , mainly contains 2 models:

  • A DM for dimensionality reduction and generation of VAE autoencoders from the latent space
  • Internal representation

Autoencoders are trained in a gan-like manner, using a discriminator on their results and applying additional regularization to represent the closeness to a standard normal distribution.

The result goes into DM decoding in the latent space: if the condition is a vector, it is concatenated with the latent vector at the input of the step, if it is a sequence of vectors, it is used for cross-attention of different U-Net layers. For text hints use CLIP vectors.

This general model can be trained for different tasks: text to image, colorization, painting, super-resolution.

4. Imagen Google / 2022

The main idea behind Imagen is that increasing the size of the text encoder can bring more benefits to the generative model than increasing the size of the DM. So CLIP was replaced with T5-XXL.

Models for generating text from images

The models in this section are often called multimodal models because they generate text while being able to analyze data of different natures. The generated text can be natural language or a set of commands, such as those for a robot.

1, CoCa Google / 2022

A separate image encoder (ViT or CNN) A shared decoder, where the first half processes text and the second half is common with the output of the image encoder Process text.

The 288x288 image is cut into 18x18 chunks and the encoder converts it into a vector vector based on a shared attention pool of all these vectors.

The output of the first half of the decoder is a text vector and a CLS token vector at the end of the sequence, tokenized using sentencepece (64K vocabulary). Text and image vectors are merged in the second half of the decoder via cross-attention.

The weights of the two losses are:

  • The similarity between the attention pool vector of the image and the CLS tag vector of the text of the image description pair.
  • Autoregressive loss of the entire decoder output (conditioned on the image).

During the fine-tuning process, the image encoder can be frozen and only the attention pool can be fine-tuned.

2. PaLM-E Google / 2023

The image is encoded by ViT, the output vector as well as the text tokens and commands are fed into PaLM, and PaLM generates the output text.

PaLM-E is used for all tasks including VQA, object detection and robot operation.

3, GPT-4 OpenAI / 2023

This is a closed model with few known details. Presumably, it has a decoder with sparse attention and multi-modal inputs. It uses autoregressive training and fine-tuning RLHF with sequence lengths from 8K to 32K.

It has been tested in human exams with zero samples and few samples, and has reached human-like levels. It can instantly and step-by-step solve image-based problems (including mathematical problems), understand and interpret images, and can analyze and generate code. Also suitable for different languages, including minority languages.

Summary

The following is a brief conclusion. They may be incomplete, or simply incorrect, and are provided for reference only.

After the automatic graphics card cannot mine, various large-scale models swarmed in, and the base of the models has been growing. However, the increase of simple layers and the growth of data sets have been replaced by various better technologies. These technologies Allows quality improvements (use of external data and tools, improved network structure and new fine-tuning techniques). But a growing body of work shows that the quality of training data is more important than quantity: Correct selection and formation of data sets can reduce training time and improve the quality of results.

OpenAI is now going closed source, and they have tried not to release the weight of GPT-2 without success. But GPT4 is a black box. The trend in recent months to improve and optimize the fine-tuning cost and inference speed of open source models has largely reduced the value of large private models as products. Open source models are also quickly catching up with the giants in quality. , which allows overtaking in corners again.

The final summary of the open source models is as follows:

  • In the encoder model block, the XLM-RoBERTa and LaBSE models are considered to be reliable multi-language solutions;
  • Among the open generative models, the most interesting ones are LLaMA and models from EleutherAI (all have their fine-tuned versions), Dolly-2, BLOOM (also have instruction fine-tuning options);
  • In terms of code, SantaCoder's model is not bad, but overall the quality is obviously lagging behind ChatGPT/GPT-4;
  • Transformer-XL and Sparse Transformer implement the technology used in other models and can be studied carefully;

The above are for reference only.

The above is the detailed content of Transformers Review: From BERT to GPT4. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete