Home > Article > Technology peripherals > “Location Embedding”: The Secret Behind Transformer
Translator | Cui Hao
Reviewer | Sun Shujuan
The introduction of the Transformer architecture in the field of deep learning undoubtedly paved the way for a silent revolution. Smoothing the road is especially important for branches of NLP. The most indispensable part of the Transformer architecture is "position embedding", which gives the neural network the ability to understand the order of words in long sentences and the dependencies between them.
We know that RNN and LSTM have been introduced before Transformer, and they have the ability to understand the ordering of words even without using positional embedding. Then, you will have an obvious question why this concept was introduced into Transformer and the advantages of this concept are so emphasized. This article will explain these causes and consequences to you.
The concept of embedding in NLPEmbedding is a process in natural language processing that is used to convert raw text into mathematical vectors. This is because the machine learning model will not be able to handle the text format directly and use it for various internal computing processes.
The embedding process for algorithms such as Word2vec and Glove is called word embedding or static embedding.
In this way, a text corpus containing a large number of words can be passed into the model for training. The model will assign a corresponding mathematical value to each word, assuming that those words that occur more frequently are similar. After this process, the resulting mathematical values are used for further calculations.
For example, consider that our text corpus has 3 sentences, as follows:
Here, we can see that the words "king" and "queen" appear frequently. Therefore, the model will assume that there may be some similarity between these words. When these words are converted into mathematical values, they are placed at a small distance when represented in a multidimensional space.
Image source: Illustrations provided by the authorAssuming there is another word "road", then logically speaking, it does not appear as frequently in this large text corpus as "king" and "queen". Therefore, the word would be far away from "King" and "Queen" and placed far away somewhere else in space.
Image source: Illustration provided by the author
In mathematics, a vector is represented by a series of numbers, where each number represents the word size in a particular dimension. For example: We put
here. Therefore, "king" is expressed in the form of [0.21, 0.45, 0.67] in three-dimensional space.
The word "Queen" can be expressed as [0.24,0.41,0.62].
The word "Road" can be expressed as [0.97,0.72,0.36].
Need for positional embeddings in Transformer
For example, let us consider the following sentences:
Sentence 1--"Although Sachin Tendulkar did not score 100 runs today, he led the team to victory".
Sentence 2--"Although Sachin Tendulkar scored 100 runs today, he failed to lead the team to victory."
The two sentences look similar because they share most of the words, but their underlying meanings are very different. The ordering and placement of words like "no" has changed the context in which the message is conveyed.
Therefore, in NLP projects, understanding location information is very critical. If a model simply uses numbers in a multidimensional space and misunderstands the context, this can have serious consequences, especially in predictive models.
To overcome this challenge, neural network architectures such as RNN (Recurrent Neural Network) and LSTM (Long-Term Short-Term Memory) were introduced. To some extent, these architectures are very successful in understanding location information. The main secret behind their success is learning long sentences by preserving the order of words. In addition to this, they also have information about words that are close to the "word of interest" and words that are far from the "word of interest".
For example, consider the following sentence--
"Sachin is the greatest cricketer of all time".
Image source: Illustrations provided by the author
The words underlined in red are these. You can see here that the "words of interest" are traversed in the order of the original text.
Furthermore, they can also learn by remembering
Image source: Illustration provided by the author
Although, through these techniques, RNN/ LSTM can understand location information in large text corpora. However, the real problem is performing a sequential traversal of words in a large corpus of text. Imagine that we have a very large text corpus with 1 million words, and it would take a very long time to go through each word in sequence. Sometimes it is not feasible to commit so much computational time to training a model.
To overcome this challenge, a new advanced architecture - "Transformer" is introduced.
An important feature of the Transformer architecture is that a text corpus can be learned by processing all words in parallel. Whether the text corpus contains 10 words or 1 million words, the Transformer architecture does not care.
Image source: Illustrations provided by the author
##Image source: Illustrations provided by the author Now we need to face the challenge of processing words in parallel. Because all words are accessed simultaneously, information about dependencies between words is lost. Therefore, the model cannot remember the associated information of a specific word and cannot accurately save it. This question again leads us to the original challenge of preserving context dependencies despite greatly reducing model computation/training time. So how to solve the above problems? The solution isContinuous trial and errorInitially, when this concept was introduced, researchers were very eager to come up with an optimized method that could preserve position information in the Transformer structure. As part of a trial and error experiment, the first approach tried was Here the idea is to introduce a new mathematical vector while using word vectors, which contains the index of the word. Image source: Illustrations provided by the authorAssume that the following picture is a representation of words in a multi-dimensional space Image source: Illustration provided by the authorAfter adding the position vector, its size and direction may change the position of each word as shown below. Image source: Illustrations provided by the authorThe disadvantage of this technique is that if the sentence is particularly long, the position vector will increase proportionally. Let's say a sentence has 25 words, then the first word will have a position vector with magnitude 0 added, and the last word will have a position vector with magnitude 24 added. This huge uncertainty can cause problems when we project these values in higher dimensions. Another technique used to reduce position vectors is Here, the fractional value of each word relative to the sentence length is calculated as the magnitude of the position vector. The score value is calculated as Value=1/N-1where "N" is the position of a specific word.For example, let us consider the example shown below--
#Image source: Illustration provided by the author
In this technique, no matter Regardless of the length of the sentence, the maximum magnitude of the position vector can be limited to 1. However, there is a big loophole. If you compare two sentences of different lengths, the embedding value of a word at a specific position will be different. A specific word or its corresponding position should have the same embedding value throughout the text corpus to facilitate understanding of its context. If the same word in different sentences has different embedding values, representing the information of the text corpus in a multi-dimensional space becomes a very complex task. Even if such a complex space is implemented, it is very likely that the model will collapse at some point due to excessive information distortion. Therefore, this technique has been excluded from the development of Transformer positional embedding.
Finally, the researchers proposed a Transformer architecture and mentioned in the famous white paper-"Attention is everything you need".
According to this technology, the researchers recommend a wave frequency-based text embedding method, using the following formula---
Image source: Illustration provided by the author
"pos" is the position or index value of a specific word in the sentence.
"d " is the maximum length/dimension of the vector representing a specific word in the sentence.
"i " represents the index of the embedding dimension of each position. It also means frequency. When i=0 it is considered to be the highest frequency, for subsequent values the frequency is considered to be of decreasing magnitude.
Image source: Illustrations provided by the author
##Image source: Illustrations provided by the author Image source: Illustration provided by the authorSince the height of the curve depends on the position of the word described on the X-axis, the height of the curve can be used as a proxy for the position of the word. If two words are highly similar, then we can consider their proximity in the sentence to be very high. Likewise, if two words are very different in height, then we can consider their proximity in the sentence to be low. Based on our example text--"Sachin is a great cricketer". Forpos = 0 d = 3i[0] = 0.21, i[1] = 0.45, i[2] = 0.67 While applying the formula. Image source: Illustrations provided by the authorWhen i =0,PE(0,0) = sin(0/ 10000^2(0)/3)PE(0,0) = sin(0)PE(0,0) = 0when i =1 ,PE(0,1) = cos(0/10000^2(1)/3)PE(0,1) = cos(0) PE(0,1) = 1When i =2,PE(0,2) = sin(0/10000^2(2)/3)PE(0,2) = sin(0)PE(0,2) = 0forpos = 3 d = 3i[0] = 0.78, i[1] = 0.64, i[2] = 0.56 While applying the formula. Image source: Illustrations provided by the authorWhen i =0,PE(3,0) = sin(3/ 10000^2(0)/3)PE(3,0) = sin(3/1)PE(3,0) = 0.05when i =1,PE(3,1) = cos(3/10000^2(1)/3)PE(3,1) = cos(3/436)PE(3,1) = 0.99When i =2, PE(3,2) = sin(3/10000^2(2)/3)PE(3,2) = sin(3/1.4)PE(3,2) = 0.03 Image source: Illustration provided by the authorHere, the maximum value will be limited to 1 (because we are using the sin/cos function). Therefore, there are no problems with high-magnitude position vectors in earlier techniques.Additionally, words that are highly close to each other may fall at similar heights at lower frequencies, while their heights will be a little different at higher frequencies.
If words are very close together, their heights will be very different even at lower frequencies, and their height differences will increase with frequency. .
For example, consider this sentence--"The king and the queen were walking on the road."
The words "King" and "Road" are placed further away.
Consider that after applying the wave frequency formula, the two words are roughly similar in height. As we get to higher frequencies (like 0), their heights will become more different.
Image source: Illustrations provided by the author
##Image source: Illustrations provided by the author Image source: Illustrations provided by the authorThe words "King" and "Queen" are placed closer together. These 2 words will be placed at similar heights at lower frequencies (like 2 here). As we get to higher frequencies (like 0), their height difference increases a bit to make them distinguishable. Image source: Illustrations provided by the authorBut what we need to pay attention to is that if the proximity of these words is low, when developing towards high frequency , their heights will be very different. If the words are very close together, then there will be only a small difference in their height as you move towards higher frequencies. SummaryThrough this article, I hope you have an intuitive understanding of the complex mathematical calculations behind position embedding in machine learning. In short, we discussed the need to achieve certain goals. For those technology enthusiasts who are interested in "natural language processing", I think these contents are helpful in understanding complex computing methods. For more detailed information, you can refer to the famous research paper-"Attention is All You Need". Translator Introduction Cui Hao, 51CTO community editor and senior architect, has 18 years of software development and architecture experience and 10 years of distributed architecture experience.Original title: Positional Embedding: The Secret behind the Accuracy of Transformer Neural Networks , Author: Sanjay Kumar
The above is the detailed content of “Location Embedding”: The Secret Behind Transformer. For more information, please follow other related articles on the PHP Chinese website!