Home > Article > Technology peripherals > Transformers Review: From BERT to GPT4
Artificial intelligence has become one of the most talked about topics in recent years, and services that were once considered purely science fiction are now becoming a reality thanks to the development of neural networks. From conversational agents to media content generation, artificial intelligence is changing the way we interact with technology. In particular, machine learning (ML) models have made significant progress in the field of natural language processing (NLP). A key breakthrough is the introduction of "self-attention" and the Transformers architecture for sequence processing, which allows several key issues that previously dominated the field to be solved.
In this article, we will look at the revolutionary Transformers architecture and how it is changing NLP, we will also take a comprehensive review of Transformers from BERT to Alpaca models, highlighting the main characteristics of each model and its potential applications.
The first part is a model based on the Transformer encoder, which is used for vectorization, classification, sequence labeling, QA (Question and Answer), NER (Named Entity Recognition), etc.
Transformer encoder, wordpiece tokenization (30K vocabulary). The input embedding consists of three vectors: a label vector, a trainable position vector, and a fragment vector (either first text or second text). The model inputs are the CLS token embeddings, the embedding of the first text, and the embedding of the second text.
BERT has two training tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, 15% of tokens are masked, 80% are replaced by MASK tokens, 10% are replaced by random tokens, and 10% remain unchanged. The model predicts the correct tokens, and the loss is calculated only on these 15% of blocked tokens. In NSP, the model predicts whether the second text follows the first text. Predictions are made on the output vector of CLS tokens.
To speed up training, first 90% of the training is performed on a sequence length of 128 tokens, and then the remaining 10% of the time is spent training the model on 512 tokens to obtain more effective position embeddings.
An improved version of BERT, which is only trained on MLM (because NSP is considered less useful), and the training sequence is longer (512 tokens). Using dynamic masking (different tokens are masked when the same data is processed again), the training hyperparameters are carefully chosen.
XLM has two training tasks: MLM and translation. Translation is essentially the same as MLM on a pair of texts, but the texts are parallel translations of each other, with random masks and segment embedding encoding languages.
4. Transformer-XL Carnegie Mellon University / 2019
Long text is divided into segments and processed one segment at a time. The output of the previous segment is cached, and when calculating the self-attention in the current segment, the keys and values are calculated based on the output of the current segment and the previous segment (just concatenated together). Gradient is also calculated only within the current segment.
This method does not work with absolute positions. Therefore, the attention weight formula is re-parameterized in the model. The absolute position encoding vector is replaced by a fixed matrix based on the sine of the distance between marker positions and a trainable vector common to all positions.
5. ERNIE Tsinghua University, Huawei / 2019
6, XLNet Carnegie Mellon University / 2019
XLNet is based on Transformer-XL, except for replacement language modeling (PLM) tasks, where it learns to predict tokens in short contexts instead of using MASK directly. This ensures that gradients are calculated for all markers and eliminates the need for special mask markers.
The tokens in the context are scrambled (for example: the i-th token can be predicted based on the i-2 and i-1 tokens), but their positions are still known. This is not possible with current positional encodings (including Transformer-XL). When trying to predict the probability of a token given part of a context, the model should not know the token itself, but should know the token's position in the context. To solve this problem, they split the self-attention into two streams:
During fine-tuning, if you ignore the query vector, the model will work like a regular Transformer-XL.
In practice the model requires that the context must be long enough for the model to learn correctly. It learned on the same amount of data as RoBERTa with similar results, but due to the complexity of the implementation, the model did not become as popular as RoBERTa.
Simplify BERT without sacrificing quality:
The model is trained on MLM and Sentence Order Prediction (SOP).
Another way to optimize BERT is distillation:
Multi-language vectorization model based on BERT. It is trained on MLM and TLM (20% of markers are masked) and then fine-tuned. It supports over 100 languages and contains a 500K tagged vocabulary.
Use generative adversarial methods to accelerate BERT training:
The amount of training data is the same as RoBERTa or XLNet, and the model is faster than BERT, RoBERTa and ALBERT Learn to a similar level of quality. The longer it is trained, the better it performs.
Another model that separates the content and position of the marker vector into two separate vectors:
A model based on complete Transformers. Its range of applications is very broad: in addition to the tasks of the previous section, it includes conversational agents, machine translation, logical and mathematical reasoning, code analysis and generation, and basically text generation. The largest and "smartest" models are usually based on decoder architectures. Such models often perform well in few-shot and zero-shot modes without fine-tuning.
The decoder is trained on the task of causal LM (predicting the next token based on the left context). From an architectural perspective, there are some minor changes: removing the cross-attention layer from each decoder block and using LayerNorm
The tokenizer used is byte-level BPE (50K vocabulary), and similar substrings such as ("dog", "dog!", "dog.") are not used. The maximum sequence length is 1024. The layer output caches all previously generated tags.
Full pre-training on MLM (15% of tokens masked), spans masked by code (
Remove Token
This is a GPT-2 model with Sparse Transformer architecture and an increased sequence length of 2048 tokens. Do you still remember that sentence: Don’t ask, just ask: GPT3
6 and mT5 Google / 2020 are based on the T5 model and have similar training, but use multi-language data. ReLU activations were replaced with GeGLU and the vocabulary was expanded to 250K tokens. 7, GLAM Google / 2021This model is conceptually similar to Switch Transformer, but focuses more on working in a few-sample mode rather than fine-tuning. Models of different sizes use 32 to 256 expert layers, K=2. Use relative position encoding from Transformer-XL. When processing tokens, less than 10% of network parameters are activated. 8, LaMDA Google / 2021A model similar to gpt. The model is a conversational model pre-trained on causal LM and fine-tuned on generation and discriminative tasks. The model can also make calls to external systems (search, translation). 9, GPT-NeoX-20B EleutherAI / 2022This model is similar to GPT-J and also uses rotational position encoding. Model weights are represented by float16. The maximum sequence length is 2048. 10. BLOOM BigScience / 2022 This is the largest open source model in 46 languages and 13 programming languages. To train the model, a large aggregated dataset called ROOTS is used, which includes approximately 500 open datasets. 11, PaLM Google / 2022This is a large multi-language decoder model, trained using Adafactor, disabling dropout during pre-training, and using 0.1 during fine-tuning. 12, LLaMA Meta / 2023An open source large-scale gpt-like LM used for scientific research and has been used to train multiple instruction models. The model uses pre-LayerNorm, SwiGLU activation and RoPE position embedding. Because it is open source, this is one of the main models for overtaking in corners. Guidance Models for TextThese model captures are used to calibrate model outputs (e.g. RLHF) to improve response quality during dialogue and task resolution. 1, InstructGPT OpenAI / 2022This work adapts GPT-3 to follow instructions efficiently. The model is fine-tuned on a dataset consisting of hints and answers that humans consider good based on a set of criteria. Based on InstructGPT, OpenAI created a model we now know as ChatGPT. 2. Flan-T5 Google / 2022A guidance model suitable for T5. In some tasks, the Flan-T5 11B outperformed the PaLM 62B without this fine-tuning. These models have been released as open source. 3, Sparrow DeepMind / 2022The basic model is obtained by fine-tuning Chinchilla on selected high-quality conversations, with the first 80% of the layers frozen. The model was then further trained using a large prompt to guide it through the conversation. Several reward models are also trained on top of Chinchilla. The model can access a search engine and retrieve snippets of up to 500 characters that can become responses. During the inference process, the reward model is used to rank candidates. Candidates are either generated by the model or obtained from the search, and then the best one becomes the response.The above guidance model of LLaMA. The main focus is on the process of building a dataset using GPT-3:
A total of 52K unique triples were generated and fine-tuned for LLaMA 7B.
This is a fine-tuning of LLaMA on the instruction data, but unlike the Alpaca above, it is not only generated by large models such as GPT-3 Fine-tune the data. The composition of the dataset is:
There is no quality increase compared to GPT-3. But in blind tests, users preferred Koala's answers to Alpaca's.
Image generator based on text description. Diffusion models combined with transformers dominate this field, enabling not only image generation but also content manipulation and resolution enhancement.
This work is carried out in two stages: training on labeling of images, and then learning a joint generative model of text and images.
In the first stage, dVAE is trained, where the image is transformed from 256x256x3 space to 32x32xdim and back, where dim is the dimension of the hidden representation vector. There are a total of 8192 such marker vectors, which will be used further in the model.
The main model used is the sparse transformer decoder. Taking text tokens and image tokens as input, the model learns a joint distribution (Causal LM), after which image tokens can be generated based on text. dVAE generates an image based on these same tokens. The weight loss for text tags is 1/8, and the weight loss for image tags is 7/8.
For text tags, there are regular embeddings and positional embeddings, and for image tags, there are regular, column-positioned and row-positioned embeddings. The maximum length of a text token sequence is 256, and the tokenization is BPE (16K vocabulary).
A diffusion model (DM) that operates at the pixel level and is controlled by text. It is based on U-Net architecture with convolution, attention and residual connections. Use different methods to control generation. Scalar product of image vectors and text vectors obtained using CLIP
Diffusion model working in pixel space , mainly contains 2 models:
Autoencoders are trained in a gan-like manner, using a discriminator on their results and applying additional regularization to represent the closeness to a standard normal distribution.
The result goes into DM decoding in the latent space: if the condition is a vector, it is concatenated with the latent vector at the input of the step, if it is a sequence of vectors, it is used for cross-attention of different U-Net layers. For text hints use CLIP vectors.
This general model can be trained for different tasks: text to image, colorization, painting, super-resolution.
The main idea behind Imagen is that increasing the size of the text encoder can bring more benefits to the generative model than increasing the size of the DM. So CLIP was replaced with T5-XXL.
The models in this section are often called multimodal models because they generate text while being able to analyze data of different natures. The generated text can be natural language or a set of commands, such as those for a robot.
A separate image encoder (ViT or CNN) A shared decoder, where the first half processes text and the second half is common with the output of the image encoder Process text.
The 288x288 image is cut into 18x18 chunks and the encoder converts it into a vector vector based on a shared attention pool of all these vectors.
The output of the first half of the decoder is a text vector and a CLS token vector at the end of the sequence, tokenized using sentencepece (64K vocabulary). Text and image vectors are merged in the second half of the decoder via cross-attention.
The weights of the two losses are:
During the fine-tuning process, the image encoder can be frozen and only the attention pool can be fine-tuned.
The image is encoded by ViT, the output vector as well as the text tokens and commands are fed into PaLM, and PaLM generates the output text.
PaLM-E is used for all tasks including VQA, object detection and robot operation.
This is a closed model with few known details. Presumably, it has a decoder with sparse attention and multi-modal inputs. It uses autoregressive training and fine-tuning RLHF with sequence lengths from 8K to 32K.
It has been tested in human exams with zero samples and few samples, and has reached human-like levels. It can instantly and step-by-step solve image-based problems (including mathematical problems), understand and interpret images, and can analyze and generate code. Also suitable for different languages, including minority languages.
The following is a brief conclusion. They may be incomplete, or simply incorrect, and are provided for reference only.
After the automatic graphics card cannot mine, various large-scale models swarmed in, and the base of the models has been growing. However, the increase of simple layers and the growth of data sets have been replaced by various better technologies. These technologies Allows quality improvements (use of external data and tools, improved network structure and new fine-tuning techniques). But a growing body of work shows that the quality of training data is more important than quantity: Correct selection and formation of data sets can reduce training time and improve the quality of results.
OpenAI is now going closed source, and they have tried not to release the weight of GPT-2 without success. But GPT4 is a black box. The trend in recent months to improve and optimize the fine-tuning cost and inference speed of open source models has largely reduced the value of large private models as products. Open source models are also quickly catching up with the giants in quality. , which allows overtaking in corners again.
The final summary of the open source models is as follows:
The above are for reference only.
The above is the detailed content of Transformers Review: From BERT to GPT4. For more information, please follow other related articles on the PHP Chinese website!