Home >Technology peripherals >AI >Jointly produced by Qingbei! A Survey to understand the ins and outs of 'Transformer+Reinforcement Learning'

Jointly produced by Qingbei! A Survey to understand the ins and outs of 'Transformer+Reinforcement Learning'

PHPz
PHPzforward
2023-04-13 14:01:031115browse

Since its release, the Transformer model has quickly become a mainstream neural architecture in supervised learning settings in the fields of natural language processing and computer vision.

Although the Transformer craze has begun to sweep across the Reinforcement Learning field, due to the characteristics of RL itself, such as the need for unique features, architecture design, etc., The current combination of Transformer and reinforcement learning is not smooth, and its development path lacks relevant papers to comprehensively summarize it.

Recently, researchers from Tsinghua University, Peking University, and Tencent jointly published a research paper on the combination of Transformer and reinforcement learning, systematically reviewing the use of Transformer in reinforcement learning. motivation and development process.

Jointly produced by Qingbei! A Survey to understand the ins and outs of Transformer+Reinforcement Learning

##Paper link: https://arxiv.org/pdf/2301.03044.pdf

The article classifies the existing related work results, conducts an in-depth discussion of each sub-field, and finally summarizes the future prospects of this research direction.

Transformer with RL

Reinforcement learning (RL) provides a mathematical form for sequential decision-making, allowing the model to automatically obtain intelligent behavior .

RL provides a general framework for learning-based control. With the introduction of deep neural networks, the versatility of deep reinforcement learning (DRL) has also made great progress in recent years. , but the sample efficiency problem hinders the widespread application of DRL in the real world.

In order to solve this problem, an effective mechanism is to introduce inductive bias in the DRL framework. The more important one is the choice of function approximator architectures. For example, the parameterization of the neural network of the DRL agent.

However, the issue of selecting architecture design in DRL is still not fully explored compared to the architecture design in supervised learning (SL), and most existing works on RL architecture The work was motivated by the success of the (semi-)supervised learning community.

For example, a common practice to handle high-dimensional image-based inputs in DRL is to introduce convolutional neural networks (CNN); another common practice to handle partial observability is to introduce recursion Neural Network (RNN).

In recent years, the Transformer architecture has revolutionized the learning paradigm in a wide range of SL tasks and has shown superior performance to CNN and RNN. For example, the Transformer architecture can handle longer dependencies. Model relationships and have excellent scalability.

Inspired by the success of SL, the industry’s interest in applying Transformer in reinforcement learning has surged, which can be traced back to a paper in 2018, in which the self-attention mechanism was used for structured Relational reasoning for state representations.

After that, many researchers began to try to apply self-attention to representation learning to extract the relationships between entities, which can lead to better policy learning.

Jointly produced by Qingbei! A Survey to understand the ins and outs of Transformer+Reinforcement Learning

#In addition to state representation learning, previous work also used Transformer to capture the temporal dependence of multi-steps to deal with partial observability issues.

Recently, offline RL has attracted attention due to its ability to utilize offline large-scale data sets. Related research results also show that the Transformer architecture can be directly used as a model for sequence decision-making and can be generalized to Multiple tasks and areas.

The purpose of this research paper is to introduce the field of Transformers in Reinforcement Learning (TransformRL).

Jointly produced by Qingbei! A Survey to understand the ins and outs of Transformer+Reinforcement Learning

Although Transformer has been considered the basic model for most current SL research, it is still less explored in the RL community. In fact, compared with the SL field, using Transformer as a function approximator in RL requires solving some different problems:

#1. The training data of the RL agent is usually the current policy function, which will cause non-stationarity in the Transformer learning process.

2. Existing RL algorithms are usually highly sensitive to design choices during the training process, including network architecture and capacity.

3. Transformer-based architectures often suffer from high computing and memory costs, which means that training and inference are slow and expensive.

For example, in some cases of artificial intelligence in games, the efficiency of sample generation greatly affects the training performance and depends on the computational cost of the RL policy network and value network.

The future of TransformRL

The paper briefly reviews the progress of Transformers for RL. Its advantages mainly include:

1. Transformers can be used as a powerful module in RL, such as a representation module or world model;

2. Transformer can be used as a sequence decision maker;

3. Transformer can improve generalization performance across tasks and domains.

Given that Transformer has shown strong performance in the broader artificial intelligence community, researchers believe that combining Transformer and RL is a promising research direction. Here are some details about this Future prospects and open questions in the direction.

Combining reinforcement learning and (self-)supervised learning

Tracing the development of TransformRL, we can find that its training methods are both Covers RL and (self-)supervised learning.

When used as a representation module trained under a traditional RL framework, the optimization of the Transformer architecture is usually unstable. The (self-)supervised learning paradigm can eliminate the deadly triad problem when using Transformers to solve decision-making problems through sequence modeling.

In the framework of (self-)supervised learning, the performance of the strategy is deeply constrained by the quality of offline data, and the clear trade-off between exploitation and exploration no longer exists , so better strategies may be learned when combining RL and (self-)supervised learning in Transformer learning.

Some work has tried supervised pre-training and fine-tuning schemes involving RL, but under relatively fixed strategies, exploration will be limited, which is also one of the bottlenecks to be solved.

Also, along this line, the tasks used for performance evaluation are also relatively simple. Can Transfomer extend this kind of (self-)supervised learning to larger data sets and more Complex environments and real-world applications also deserve further exploration.

Additionally, the researchers hope that future work will provide additional theoretical and empirical insights into the conditions under which such (self-)supervised learning is expected to perform well.

Jointly produced by Qingbei! A Survey to understand the ins and outs of Transformer+Reinforcement Learning

Connect online and offline learning through Transformer

Step into offline RL It is a milestone for TransformRL, but in fact, using Transformer to capture dependencies in decision sequences and abstract strategies is mainly inseparable from the support of considerable offline data used.

However, for some decision-making tasks, it is not feasible to get rid of the online framework in practical applications.

On the one hand, it is not so easy to obtain expert data in some tasks; on the other hand, some environments are open-ended (such as Minecraft), which means that strategies must be constantly adjusted , to handle tasks not seen during online interactions.

Therefore, researchers believe that it is necessary to connect online learning and offline learning.

Most research progress after Decision Transformer focuses on offline learning frameworks, and some work attempts to adopt the paradigm of offline pre-training and online fine-tuning. However, the distribution shift in online fine-tuning still exists in offline RL algorithms, and researchers expect to solve this problem through some special designs of the Decision Transformer.

Furthermore, how to train an online Decision Transformer from scratch is an interesting open question.

Transformer structure tailored for Decision-making problems

Transformer structure in the current Decision Transformer series of methods Mainly a vanilla Transformer, which was originally designed for text sequences and may have some properties that are not suitable for decision problems.

For example, is it appropriate to use a vanilla self-attention mechanism for trajectory sequences? Do different elements in a decision sequence or different parts of the same element need to be distinguished in positional embedding?

In addition, since there are many variants of representing trajectories as sequences in different Decision Transformer algorithms, there is still a lack of systematic research on how to choose among them.

For example, how to choose robust HindSight information when deploying such algorithms in industry?

And the vanilla Transformer is also a structure with huge computational cost, which makes it expensive in the training and inference stages, and has a high memory usage, which also limits its ability to capture dependencies. length.

In order to alleviate these problems, some work in NLP has improved the structure of Transformer, but whether a similar structure can be used for decision-making problems is also worth exploring.

Use Transformer to implement more general agents

In the paper, the generalist agents (generalist agents) Transformers The review has shown the potential of Transformers as a general strategy.

In fact, the design of Transformer allows processing of multiple modalities (such as images, videos, text, and voice) in a similar way to processing blocks, and demonstrates the need for ultra-large-capacity networks and Excellent scalability for huge data sets.

Recent work has also made significant progress in training agents capable of performing multimodal and cross-domain tasks.

However, given that these agents are trained on large-scale data sets, it is not yet certain whether they just memorize the data sets and whether they can perform effective Generalize.

Therefore, how to learn an agent that can generalize to unseen tasks without strong assumptions is still a question worth studying.

Additionally, researchers are curious whether Transformer is powerful enough to learn a general world model that can be used for different tasks and scenarios.

RL for Transformers

While the article has discussed how RL can benefit from the Transformer model, the reverse is also true That said, using RL to improve Transformer training remains an interesting open problem that has not been well explored.

It can be seen that the recent reinforcement learning from human feedback (RLHF) can learn a reward model and use the RL algorithm to fine-tune the Transformer to make the language model consistent with human intentions. consistent.

In the future, the researchers believe that RL can become a useful tool to further improve Transformer's performance in other fields.

The above is the detailed content of Jointly produced by Qingbei! A Survey to understand the ins and outs of 'Transformer+Reinforcement Learning'. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete