Home > Article > Technology peripherals > A preliminary exploration into the evolution of natural language pre-training technology
Three levels of artificial intelligence:
Computing functions: data storage and computing capabilities, machines are far better than humans.
Perceptual functions: vision, hearing and other abilities. Machines are already comparable to humans in the fields of speech recognition and image recognition.
Cognitive intelligence: For tasks such as natural language processing, common sense modeling and reasoning, machines still have a long way to go.
Natural language processing belongs to the category of cognitive intelligence. Because natural language has the characteristics of abstraction, combination, ambiguity, knowledge, and evolution, it brings great challenges to machine processing. Some people use natural language to process natural language. Language processing is called the crown jewel of artificial intelligence. In recent years, pre-trained language models represented by BERT have emerged, bringing natural language processing into a new era: pre-trained language models fine-tuned for specific tasks. This article attempts to sort out the evolution of natural language pre-training technology, with a view to communicating and learning with everyone. We welcome criticism and correction of shortcomings and fallacies.
Use a vector of the size of a vocabulary to represent a word, where the value of the corresponding position of the word is 1, and the remaining positions are 0. Disadvantages:
Distributed semantics hypothesis: similar words have similar contexts, and the semantics of words can be represented by context. Based on this idea, the context distribution of each word can be used to represent words.
Based on the corpus, the context of the word is used to construct a co-occurrence frequency table. Each row of the word table represents the vector representation of a word. Different language information can be captured through different context selections. For example, if the words in the fixed window around the word in the sentence are used as the context, more local information of the word will be captured: lexical and syntactic information. If the document is used as the context, Capture more of the topic information represented by the word. Disadvantages:
Replace the value in word frequency representation with TF-IDF, which mainly alleviates the problem of high-frequency words in word frequency representation.
It also alleviates the high-frequency word problem of word frequency representation. The value in the word frequency representation is replaced by the point mutual information of the word:
By performing Singular Value Decomposition (SVD) on the word frequency matrix, a low-dimensional, continuous, dense vector representation of each word can be obtained , can be considered to represent the latent semantics of the word, this method is also called latent semantic analysis (Latent Semantic Analysis, LSA).
LSA alleviates problems such as high-frequency words, high-order relationships, sparsity, etc., and the effect is still good in traditional machine learning algorithms, but there are also some shortcomings:
The orderliness of text and the co-occurrence relationship between words provide natural self-supervised learning signals for natural language processing , enabling the system to learn knowledge from text without additional manual annotation.
CBOW(Continous Bag-of-Words) uses the context (window) to predict the target word, and combines the words of the context words The vectors are arithmetic averaged and then the probability of the target word is predicted.
Skip-gram predicts context by word.
GloVe (Global Vectors for Word Representation) uses word vectors to predict the co-occurrence matrix of words and implements implicit matrix decomposition . First, a distance-weighted co-occurrence matrix X is constructed based on the context window of the word, and then the vector of the word and context is used to fit the co-occurrence matrix X:
The loss function is:
##2.3 Summary Learning and utilization of word vectors In addition to the co-occurrence information between words in the corpus, the underlying idea is still the distributed semantic hypothesis. Whether it is Word2Vec based on local context or GloVe based on explicit global co-occurrence information, the essence is to aggregate the co-occurrence context information of a word in the entire corpus into the vector representation of the word, and have achieved good results. , the training speed is also very fast, but the vector of shortcomings is static, that is, it does not have the ability to change with context changes. 3. Modern - Pre-trained language modelAutoregressive language model: Calculate the conditional probability of the word at the current moment based on the sequence history. #Auto-encoding language model: reconstruct masked words through context. represents the masked sequence3.1 Cornerstone——Transformer##3.1.1 Attention model
3.1.2 Multi-Head Self-Attention
When Q, K, and V come from the same vector sequence, it becomes a self-attention model.
Multi-head self-attention: Set up multiple groups of self-attention models, splice their output vectors, and map them to the dimension size of the Transformer hidden layer through a linear mapping. The multi-head self-attention model can be understood as an ensemble of multiple self-attention models.
##3.1.3 Position encoding
Since the self-attention model does not consider the position information of the input vector, but the position Information is critical for sequence modeling. Position information can be introduced through position embedding or position encoding. Transformer uses position encoding.3.1.4 Others
In addition, residual connection, Layer Normalization and other technologies are also used in the Transformer block. 3.1.5 Advantages and DisadvantagesCompared with RNN, it can model longer-distance dependencies, and the attention mechanism will The distance between words is reduced to 1, resulting in stronger ability to model long sequence data.
Compared with RNN, the parameters are larger, which increases the difficulty of training and requires more training data.
ELMo independently models forward and backward language models through LSTM, forward language model:
Backward language model:
Maximization:
After ELMo is trained, the following vectors can be obtained for use in downstream tasks.
is the word embedding obtained by the input layer, and is the result of splicing the forward and backward LSTM outputs.
When used in downstream tasks, the vectors of each layer can be weighted to obtain a vector representation of ELMo, and a weight can be used to scale the ELMo vector.
Different levels of hidden layer vectors contain text information at different levels or granularities:
Model structure
In GPT-1 (Generative Pre-Training), it is a one-way language model that uses 12 transformer block structures as decoders. Each transformer block is a multi-head self-attention mechanism. , and then obtain the probability distribution of the output through full connection.
Maximization:
Downstream application
In the downstream task, for a labeled data set, each instance has an input token:, which consists of the label. First, these tokens are input into the trained pre-training model to obtain the final feature vector. Then the prediction result is obtained through a fully connected layer:
The goal of the downstream supervised task is to maximize:
In order to prevent catastrophic forgetting problems, a certain weight of pre-training loss can be added to the fine-tuning loss, usually pre-training loss.
The core idea of GPT-2 can be summarized as: any supervised task is a subset of the language model. When the capacity of the model is very large and the amount of data is rich enough, training alone The learning of language models can complete other supervised learning tasks. Therefore, GPT-2 did not carry out too many structural innovations and designs on the GPT-1 network. It just used more network parameters and a larger data set. The goal was to train a word vector with stronger generalization ability. Model.
Among the 8 language model tasks, GPT-2 has 7 surpassed the state-of-the-art methods at the time through zero-shot learning alone (of course, some tasks are still not as good as the supervised model) good). The biggest contribution of GPT-2 is to verify that word vector models trained with massive data and a large number of parameters can be transferred to other categories of tasks without additional training.
At the same time, GPT-2 showed that as the model capacity and training data volume (quality) increase, there is room for further development of its potential. Based on this idea, GPT-3 was born.
There is still no change in the model structure, but the model capacity, training data volume and quality are increased. It is known as a giant, and the effect is also very good.
From GPT-1 to GPT-3, as the model capacity and the amount of training data increase, the language knowledge learned by the model also increases. Rich, the paradigm of natural language processing has gradually changed from "pre-training model fine-tuning" to "pre-training model zero-shot/few-shot learning". The disadvantage of GPT is that it uses a one-way language model. BERT has proven that a two-way language model can improve the model effect.
XLNet introduces two-way contextual information through the permutation language model (Permutation Language Model). It does not introduce special tags and avoids inconsistent token distribution in the pre-training and fine-tuning phases. The problem. At the same time, Transformer-XL is used as the main structure of the model, which has better effects on long texts.
The goal of the permutation language model is:
is the set of all possible permutations of the text sequence.
This method uses the position information of the predicted word.
When applying downstream tasks, no query representation is required, and no mask is required.
Masked language model (MLM), random Partially masking words, and then using contextual information to make predictions. There is a problem with MLM, there is a mismatch between pre-training and fine-tuning, because the [MASK] token is never seen during fine-tuning. To solve this problem, BERT does not always replace the "masked" word piece token with the actual [MASK] token. The training data generator randomly selects 15% of the tokens and then:
In native BERT, tokens are masked, and whole words or phrases (N-Gram) can be masked.
Next sentence prediction (NSP): When sentences A and B are selected as pre-training samples, B has a 50% chance of being the next sentence of A, and a 50% chance may be random sentences from the corpus.
The classic "pre-training model fine-tuning" Paradigm,theme structure is stacked multi-layer Transformers.
RoBERTa (Robustly Optimized BERT Pretraining Approach) does not drastically improve BERT, but only conducts detailed experiments on every design detail of BERT to find room for improvement of BERT.
BERT has a relatively large number of parameters. The main goal of ALBERT (A Lite BERT) is to reduce the number of parameters:
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) introduces the model of generator and discriminator, transforming the generative Masked language model into (MLM) pre-training task was changed to a discriminative Replaced token detection (RTD) task, which determines whether the current token has been replaced by the language model, which is similar to the idea of GAN.
The generator predicts the token at the mask position in the input text:
The input of the discriminator is the output of the generator, and the discriminator predicts whether the words at each position have been replaced:
In addition, some optimizations have been made:
Only use the discriminator, not the generator, in downstream tasks.
Transformer A common strategy for processing long text is to split the text into fixed-length blocks and encode each block independently, without any interruption between blocks. Information exchange.
In order to optimize the modeling of long text, Transformer-XL uses two technologies: Segment-Level Recurrence with State Reuse and Relative Positional Encodings.
Transformer-XL is also input in the form of fixed-length segments during training. The difference is that Transformer-XL’s previous The state of the fragment is cached and then the hidden state of the previous time slice is reused when calculating the current segment, giving Transformer-XL the ability to model longer-term dependencies.
Two consecutive segments of length L and. The state of the hidden layer node is expressed as, where d is the dimension of the hidden layer node. The calculation process of the status of the hidden layer node is:
Another benefit of fragment recursion is the improvement in reasoning speed. Compared with Transformer's autoregressive architecture, which can only advance one time slice at a time, Transformer-XL's reasoning process directly reuses the representation of the previous fragment instead of Calculate from scratch and improve the reasoning process to reasoning in fragments.
In Transformer, the self-attention model can be expressed as:
## The complete expression of
# is: The problem with Transformer is that no matter which fragment it is, their position encoding is are the same, that is to say, the Transformer's position encoding is absolute position encoding relative to the fragment, and has nothing to do with the relative position of the current content in the original sentence. Transfomer-XL made several changes based on the above formula and obtained the following calculation method:DistillBert’s student model:
Teacher model: BERT-base:
Loss function:
Supervised MLM loss: using mask Cross-entropy loss obtained from code language model training:
https ://www.php.cn/link/6e2290dbf1e11f39d246e7ce5ac50a1e
https://www.php.cn/link/664c7298d2b73b3c7fe2d1e8d1781c06
https://www.php.cn/link/67b878df6cd42d142f2924f3ace85c78
##https://www.php.cn/link/f6a673f09493afcd8b129a0bcf1cd5bc
https://www.php.cn/link/82599a4ec94aca066873c99b4c741ed8
#https://www. php.cn/link/2e64da0bae6a7533021c760d4ba5d621##
https://www.php.cn/link/56d33021e640f5d64a611a71b5dc30a3https://www.php.cn/link/4e38d30e656da5ae9d3a425109ce9e04https://www.php.cn/link/c055dcc749c2632fd4dd806301f05ba6https://www.php.cn/link/a749e38f556d5eb1dc13b9221d1f994fhttps://www.php.cn/link /8ab9bb97ce35080338be74dc6375e0ed##https://www.php.cn/link/4f0bf7b7b1aca9ad15317a0b4efdca14
https:/ /www.php.cn/link/b81132591828d622fc335860bffec150
##https://www.php.cn/link/fca758e52635df5a640f7063ddb9cdcb
https://www.php.cn/link/5112277ea658f7138694f079042cc3bb
##https://www.php.cn/link/257deb66f5366aab34a23d5fd0571da4
https://www.php.cn/link/b18e8fb514012229891cf024b6436526
#https://www.php. cn/link/836a0dcbf5d22652569dc3a708274c16
https://www.php.cn/link/a3de03cb426b5e36f5c7167b21395323
https://www.php.cn/link/831b342d8a83408e5960e9b0c5f31f0c
https://www.php.cn/link/6b27e88fdd7269394bca4968b48d8df4
https://www.php.cn/link/682e0e796084e163c5ca053dd8573b0c##
https://www.php.cn/link/9739efc4f01292e764c86caa59af353e https://www.php.cn/link/b93e78c67fd4ae3ee626d8ec0c412dechttps://www .php.cn/link/c8cc6e90ccbff44c9cee23611711cdc4The above is the detailed content of A preliminary exploration into the evolution of natural language pre-training technology. For more information, please follow other related articles on the PHP Chinese website!