Deep learning is being widely used in various daily tasks, especially in areas involving a certain degree of "humanity", such as image recognition. Unlike other machine learning algorithms, the most prominent feature of deep networks is that their performance can continue to improve as more data is obtained. Therefore, if more data is available, the expected performance becomes better.
One of the tasks that deep networks are best at is machine translation. Currently, it is the most advanced technology capable of this task and is feasible enough that even Google Translate uses it. In machine translation, sentence-level parallel data is needed to train the model, that is, for each sentence in the source language, it needs to be the translated language in the target language. It's not hard to imagine why this would be a problem. Because, for some language pairs, it is difficult to obtain large amounts of data (hence the ability to use deep learning).
How this article is constructed
This article is based on an article recently published by Facebook called "Unsupervised machine translation using only monolingual corpora." This article does not completely follow the structure of the paper. I added some of my own interpretations to make the article more understandable.
Reading this article requires some basic knowledge about neural networks, such as loss functions, autoencoders, etc.
Problems with Machine Translation
As mentioned above, the biggest problem with using neural networks in machine translation is that it requires a dataset of sentence pairs in two languages. It works for widely spoken languages such as English and French, but not for sentence pairs in other languages. If the language is available on the data, then this becomes a supervised task.
Solution
The authors of this paper figured out how to convert this task into an unsupervised task. The only thing required in this task is any two corpora in each of the two languages, such as any novel in English and any novel in Spanish. One thing to note is that the two novels are not necessarily the same.
From the most intuitive perspective, the author discovered how to learn a latent space between two languages.
Autoencoders Overview
Autoencoders are a broad class of neural networks used for unsupervised tasks. It works by recreating an input identical to the original input. The key to accomplishing this is a network layer in the middle of the network called the bottleneck layer. This network layer is used to capture all useful information about the input and discard useless information.
Conceptual autoencoder, the intermediate module is the bottleneck layer that stores the compressed representation
In short, in the bottleneck layer, the input in the bottleneck layer ( The space now transformed by the encoder) is called latent space.
Denoising Autoencoder
If an autoencoder is trained to reconstruct the input exactly the way it was input, then it may not be able to do anything. In this case, the output will be perfectly reconstructed, but without any useful features in the bottleneck layer. To solve this problem, we use a denoising autoencoder. First, the actual input is slightly disturbed by the addition of some noise. The network is then used to reconstruct the original image (not the noisy version). This way, by learning what noise is (and what its truly useful features are), the network can learn useful features of the image.
A conceptual example of a denoising autoencoder. Use a neural network to reconstruct the left image and generate the right image. In this case, the green neurons together form the bottleneck layer
Why learn a common latent space?
Latent space can capture the characteristics of the data (in our example, the data is sentences). So if it is possible to obtain a space that, when input to language A, produces the same features as input to language B, then it is possible for us to translate between them. Since the model already has the correct "features", it is encoded by the encoder of language A and decoded by the decoder of language B, which will allow the two to do an efficient translation job.
Perhaps as you think, the author uses a denoising autoencoder to learn a feature space. They also figured out how to make the autoencoder learn a common latent space (which they call an aligned latent space) to perform unsupervised machine translation.
Denoising Autoencoders in Languages
The authors use denoising encoders to learn features in an unsupervised manner. The loss function they defined is:
Equation 1.0 Automatic de-noising encoder loss function
Interpretation of Equation 1.0
I is the language (for this setting, there may be two languages). X is the input, and C(x) is the result after adding noise to x. We will soon get the function C created by the noise. e() is the encoder and d() is the decoder. The last term Δ(x hat,x) is the sum of cross-entropy error values at the token level. Since we have an input sequence, and we get an output sequence, we want to make sure that each token is in the correct order. Hence this loss function is used. We can think of it as multi-label classification, where the label of the i-th input is compared with the i-th output label. Among them, the token is a basic unit that cannot be further destroyed. In our example, the token is a word. Equation 1.0 is a loss function that causes the network to minimize the difference between the output (when given a noisy input) and the original, unaffected sentence. The symbolic representation of
□ with ~
□ is the representation we expect, which in this case means that the distribution of the input depends on the language l, and the mean of the loss is taken. This is just a mathematical form, the actual loss during the operation (sum of cross-entropy) will be as usual.
This special symbol ~ means "from a probability distribution".
We will not discuss this detail in this article. You can learn more about this symbol in Chapter 8.1 of the Deep Learning Book article.
How to add noise
For images, you can add noise by simply adding floating point numbers to the pixels, but for languages, you need to use other methods. Therefore, the authors developed their own system to create noise. They denote their noise function as C(). It takes a sentence as input and outputs a noisy version of the sentence.
There are two different ways to add noise.
First, one can simply remove a word from the input with probability P_wd.
Secondly, each word can use the following constraint to shift the original position:
σ represents the i-th The shifted position of the marker. Therefore, equation 2.0 means: "A token can move up to k token positions to the left or right"
The author sets the K value to 3 and the P_wd value to 1 .
Cross-domain training
In order to learn translation between two languages, the input sentence (language A) should be mapped to the output sentence (language B) through some processing. The author calls this process cross domain training. First, the input sentence (x) is sampled. Then, the model from the previous iteration (M()) is used to generate the translated output (y). Putting them together, we get y=M(x). Subsequently, the same noise function C() above is used to interfere with y, and C(y) is obtained. The encoder for language A encodes this perturbed version, and the decoder for language B decodes the output of the encoder for language A and reconstructs a clean version of C(y). The model is trained using the same sum of cross entropy error values as in Equation 1.0.
Using adversarial training to learn a common latent space
So far, there has been no mention of how to learn a common latent space. The cross-domain training mentioned above helps to learn a similar space, but a stronger constraint is needed to push the model to learn a similar latent space.
The author uses adversarial training. They used another model (called a discriminator) that took the output of each encoder and predicted which language the encoded sentences belonged to. Then, the gradients are extracted from the discriminator and the encoder is trained to fool the discriminator. This is conceptually no different from a standard GAN (Generative Adversarial Network). The discriminator receives the feature vector at each time step (because an RNN is used) and predicts which language it comes from.
Combining them together
Add the above 3 different losses (autoencoder loss, translation loss and discriminator loss), and the weights of all models are updated simultaneously.
Since this is a sequence-to-sequence problem, the author uses a long short-term memory network (LSTM). It should be noted that there are two LSTM-based autoencoders, each containing one.
At a high level, training this architecture requires three main steps. They follow an iterative training process. The training loop process looks a bit like this:
1. Use the encoder for language A and the decoder for language B to get the translation.
2. Train each autoencoder to be able to regenerate an uncorrupted sentence when given a corrupted sentence.
3. Improve the translation and recreate it by destroying the translation obtained in step 1. For this step, the encoder for language A is trained together with the decoder for language B (the encoder for language B is trained together with the decoder for language A).
It is worth noting that even if step 2 and step 3 are listed separately, the weights will be updated together.
How to start this framework
As mentioned above, the model uses its own translations from previous iterations to improve its translation capabilities. Therefore, it is important to have some translation skills before the recycling process begins. The author uses FastText to learn word-level bilingual dictionaries. Note that this method is very simple and only requires giving the model a starting point.
The entire framework is given in the flow chart below
The high-level work of the entire translation framework
This article interprets a method that can New techniques for performing unsupervised machine translation tasks. It uses multiple different losses to improve a single task while using adversarial training to enforce constraints on the behavior of the architecture.