Home >Technology peripherals >AI >Cited 38,000 times in five years, the Transformer universe has developed like this
Since it was proposed in 2017, the Transformer model has shown unprecedented strength in other fields such as natural language processing and computer vision, and triggered technological breakthroughs such as ChatGPT. People have also proposed various original-based Variants of the model.
As academia and industry continue to propose new models based on the Transformer attention mechanism, it is sometimes difficult for us to summarize this direction. Recently, a comprehensive article by Xavier Amatriain, head of AI product strategy at LinkedIn, may help us solve this problem.
#In the past few years, one after another There are dozens of models from the Transformer family, all with interesting and understandable names. The goal of this article is to provide a comprehensive but simple catalog and classification of the most popular Transformer models. In addition, this article also introduces the most important aspects and innovations in Transformer models.
The paper "Transformer models: an introduction and catalog":
Paper link:
##https://arxiv.org/abs/2302.07730
GitHub: https://github.com/xamat/TransformerCatalogIntroduction: What is Transformer
Transformer is a class consisting of Deep learning models defined by architectural features. First appeared in the famous paper "Attention is All you Need" published by Google researchers in 2017 (this paper has been cited more than 38,000 times in just 5 years) and related blog posts. The Transformer architecture is a specific instance of the encoder-decoder model [2] which became popular 2-3 years ago. However, until then, attention was only one of the mechanisms used by these models, which were mainly based on LSTM (Long Short-Term Memory) [3] and other RNN (Recurrent Neural Network) [4] variants. The key insight of the Transformers paper is that, as the title suggests, attention can be used as the only mechanism for deriving dependencies between inputs and outputs. Discussing all the details of the Transformer architecture is beyond the scope of this blog. For this purpose, this article recommends referring to the original paper above or Transformers’ post, both of which are very exciting. Having said that, this article will briefly describe the most important aspects and they will also be mentioned in the table of contents below. This article will start with the basic architecture diagram in the original paper, and then expand on the related content.
Encoder/Decoder Architecture
Universal Encoder/Decoder Architecture (see Figure 1) By Composed of two models. The encoder takes input and encodes it into a fixed length vector. The decoder takes this vector and decodes it into an output sequence. The encoder and decoder are jointly trained to minimize the conditional log-likelihood. Once trained, the encoder/decoder can generate an output given a sequence of inputs, or it can score the input/output sequences. In the original Transformer architecture, both the encoder and decoder had 6 identical layers. Each encoder in these 6 layers has two sub-layers: a multi-head attention layer and a simple feedforward network. Each sub-layer has a residual connection and a layer normalization. The output size of the encoder is 512. The decoder adds a third sub-layer, which is another multi-head attention layer on the encoder output. Additionally, another multi-head layer in the decoder is masked.
Figure 1: Transformer architecture
Figure 2: Attention mechanism
##Attention
It is clear from the above description that the only special element of the model architecture is multi-head attention, but, as described above, this is where the full power of the model lies. So, what exactly is attention? An attention function is a mapping between a query and a set of key-value pairs to an output. The output is calculated as a weighted sum of values, where the weight assigned to each value is calculated by the query's compatibility function with the corresponding key. Transformers use multi-head attention, which is the parallel computation of a specific attention function called scaled dot product attention. For more details on how the attention mechanism works, this article will refer again to The Illustrated Transformer post, and the diagram from the original paper will be reproduced in Figure 2 to understand the main idea. Attention layers have several advantages over recurrent and convolutional networks, the most important two being their lower computational complexity and higher connectivity, which are especially useful for learning long-term dependencies in sequences .
What are Transformers used for and why are they so popular
The original Transformer was designed for language translation , especially from English to German. However, as can be seen from the original research paper, the architecture generalizes well to other language tasks. This particular trend quickly caught the attention of the research community. In the following months, most language-related ML task rankings were completely dominated by some version of the Transformer architecture (e.g., the famous SQUAD ranking, where all top models are a collection of Transformers ). One of the key reasons why Transformers have dominated most NLP rankings so quickly is their ability to quickly adapt to other tasks, a.k.a. transfer learning. Pretrained Transformer models can be adapted very easily and quickly to tasks for which they were not trained, which has a huge advantage. As an ML practitioner, you no longer need to train large models on huge data sets. All you need to do is reuse the pretrained model in your task, maybe just tweak it slightly with a much smaller dataset. One specific technique used to adapt a pretrained model to different tasks is called fine-tuning.
Transformers proved so adaptable to other tasks that, although they were originally developed for language-related tasks, they were quickly adopted for Other tasks range from visual or audio and music applications, all the way to playing chess or doing math.
Of course, none of these applications would be possible if it weren't for the myriad of tools that allow anyone to easily write a few lines of code. Transformer can not only be quickly integrated into major artificial intelligence frameworks (i.e. Pytorch8 and TF9), but entire companies can even be built based on it. Huggingface, a startup that has raised over $60 million to date, was built almost entirely around the idea of commercializing the open-source Transformer library.
Finally, it is necessary to talk about the impact of GPT-3 on Transformer in the early stages of its popularity. GPT-3 is a Transformer model launched by OpenAI in May 2020 and is a follow-up to their earlier GPT and GPT-2. The company created a lot of buzz by introducing the model in a preprint, which they claimed was so powerful that they couldn't release it to the world. Since then, the model has not only been released, but also commercialized through a massive collaboration between OpenAI and Microsoft. GPT-3 supports over 300 different applications and is fundamental to OpenAI's business strategy (which makes sense for a company that has raised over $1 billion in funding).
RLHF
Recently, reinforcement learning from human feedback (or preferences) (RLHF (also known as RLHP) ) has become a huge addition to the artificial intelligence toolkit. The concept was already proposed in the 2017 paper "Deep reinforcement learning from human preferences". More recently, it has been applied to ChatGPT and similar conversational agents such as BlenderBot or Sparrow. The idea is simple: once a language model is pre-trained, users can generate different responses to conversations and have humans rank the results. One can use these rankings (aka preferences or feedback) in a reinforcement learning environment to train rewards (See Figure 3).
Diffusion
Diffusion models have become the new SOTA in image generation, apparently pushing aside previous methods such as GANs (Generative Adversarial Networks). What is a diffusion model? They are a class of latent variable models trained with variational inference. A network trained in this way actually learns the latent space represented by these images (see Figure 4).
Diffusion models are related to other generative models, such as the famous [Generative Adversarial Networks (GAN)] 16 , which have been replaced in many applications, especially with (denoising) Autoencoder. Some authors even say that diffusion models are just a specific instance of autoencoders. However, they also acknowledge that small differences do change their application from the underlying representation of the autoconder to the purely generative nature of the diffusion model.
Figure 3: Reinforcement learning with human feedback.
## Figure 4: Probabilistic diffusion model architecture excerpted from "Diffusion Models" : A Comprehensive Survey of Methods and Applications》
The models introduced in this article include:
##
The above is the detailed content of Cited 38,000 times in five years, the Transformer universe has developed like this. For more information, please follow other related articles on the PHP Chinese website!