Home > Article > Technology peripherals > StarCraft II cooperative confrontation benchmark surpasses SOTA, new Transformer architecture solves multi-agent reinforcement learning problem
Multi-agent reinforcement learning (MARL) is a challenging problem that not only requires identifying the policy improvement direction of each agent, but also requires combining the policy updates of individual agents to improve Overall performance. Recently, this problem has been initially solved, and some researchers have introduced the centralized training decentralized execution (CTDE) method, which allows the agent to access global information during the training phase. However, these methods cannot cover the full complexity of multi-agent interactions.
In fact, some of these methods have proven to be failures. In order to solve this problem, someone proposed the multi-agent dominance decomposition theorem. On this basis, the HATRPO and HAPPO algorithms are derived. However, there are limitations to these approaches, which still rely on carefully designed maximization objectives.
In recent years, sequence models (SM) have made substantial progress in the field of natural language processing (NLP). For example, the GPT series and BERT perform well on a wide range of downstream tasks and achieve strong performance on small sample generalization tasks.
Since sequence models naturally fit with the sequence characteristics of language, they can be used for language tasks. However, sequence methods are not limited to NLP tasks, but are a widely applicable general basic model. For example, in computer vision (CV), one can split an image into subimages and arrange them in a sequence as if they were tokens in an NLP task. The more famous recent models such as Flamingo, DALL-E, GATO, etc. all have the shadow of the sequence method.
With the emergence of network architectures such as Transformer, sequence modeling technology has also attracted great attention from the RL community, which has promoted a series of offline RL development based on the Transformer architecture. These methods show great potential in solving some of the most fundamental RL training problems.
Despite the notable success of these methods, none was designed to model the most difficult (and unique to MARL) aspect of multi-agent systems— Interaction between agents. In fact, if we simply give all agents a Transformer policy and train them individually, this is still not guaranteed to improve the MARL joint performance. Therefore, while there are a large number of powerful sequence models available, MARL does not really take advantage of sequence model performance.
How to use sequence models to solve MARL problems? Researchers from Shanghai Jiao Tong University, Digital Brain Lab, Oxford University, etc. proposed a new multi-agent Transformer (MAT, Multi-Agent Transformer) architecture, which can effectively transform collaborative MARL problems into sequence model problems. Its tasks It maps the agent's observation sequence to the agent's optimal action sequence.
The goal of this paper is to build a bridge between MARL and SM in order to unlock the modeling capabilities of modern sequence models for MARL. The core of MAT is the encoder-decoder architecture, which uses the multi-agent advantage decomposition theorem to transform the joint strategy search problem into a sequential decision-making process, so that the multi-agent problem will exhibit linear time complexity, and most importantly, Doing so ensures monotonic performance improvement of MAT. Unlike previous techniques such as Decision Transformer that require pre-collected offline data, MAT is trained in an online strategic manner through online trial and error from the environment.
To verify MAT, researchers conducted extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation and Google Research Football benchmarks. The results show that MAT has better performance and data efficiency compared to strong baselines such as MAPPO and HAPPO. In addition, this study also proved that no matter how the number of agents changes, MAT performs better on unseen tasks, but it can be said to be an excellent small sample learner.
In this section, the researcher first introduces the collaborative MARL problem formula and the multi-agent advantage decomposition theorem, which are the cornerstones of this article. Then, they review existing MAT-related MARL methods, finally leading to Transformer.
Comparison of the traditional multi-agent learning paradigm (left) and the multi-agent sequence decision-making paradigm (right).
Collaborative MARL problems are usually composed of discrete partially observable Markov decision processes (Dec-POMDPs) to model.
The agent evaluates the value of actions and observations through Q_π(o, a) and V_π(o), which are defined as follows.
Theorem 1 (Multi-agent Advantage Decomposition): Let i_1:n be the arrangement of agents. The following formula always holds without further assumptions.
# Importantly, Theorem 1 provides an intuition for how to choose incremental improvement actions.
Researchers have summarized two current SOTA MARL algorithms, both of which are built on Proximal Policy Optimization (PPO) . PPO is an RL method known for its simplicity and performance stability.
Multi-Agent Proximal Policy Optimization (MAPPO) is the first and most straightforward method to apply PPO to MARL.
Heterogeneous Agent Proximal Policy Optimization (HAPPO) is one of the current SOTA algorithms, which can make full use of Theorem (1) to Achieving multi-agent trust domain learning with monotonic lifting guarantees.
Transformer model
Based on what is described in Theorem (1) Sequence properties and the principles behind HAPPO can now be intuitively considered to use the Transformer model to implement multi-agent trust domain learning. By treating an agent team as a sequence, the Transformer architecture allows modeling of agent teams with variable numbers and types while avoiding the shortcomings of MAPPO/HAPPO.
In order to realize the sequence modeling paradigm of MARL, the solution provided by the researchers is the multi-agent Transformer (MAT). The idea of applying the Transformer architecture stems from the fact that the agent observes the relationship between the input of the sequence (o^i_1,..., o^i_n) and the output of the action sequence (a^i_1, . . ., a^i_n) Mapping is a sequence modeling task similar to machine translation. As Theorem (1) avoids, action a^i_m depends on the previous decisions of all agents a^i_1:m−1.
Therefore, as shown in Figure (2) below, MAT contains an encoder for learning joint observation representation and an autoregressive method to output actions for each agent. decoder.
The parameters of the encoder are represented by φ, which obtains the observation sequence in any order (o^i_1 , . . . , o^i_n) and pass them through several computational blocks. Each block consists of a self-attention mechanism, a multilayer perceptron (MLP), and residual connections to prevent vanishing gradients and network degradation with increasing depth.
The parameters of the decoder are represented by θ, which embeds the joint action a^i_0:m−1, m = {1, . . . n} (where a^i_0 is Any symbol indicating the start of decoding) is passed to the decoding block sequence. Crucially, each decoding block has a masked self-attention mechanism. To train the decoder, we minimize the cropped PPO objective as follows.
#The detailed data flow in MAT is shown in the following animation.
To evaluate whether MAT meets expectations, researchers tested the StarCraft II Multi-Agent Challenge (SMAC) benchmark (MAPPO on top of MAT was tested on the multi-agent MuJoCo benchmark (on which HAPPO has SOTA performance).
In addition, the researchers also conducted extended tests on MAT on Bimanual Dxterous Hand Manipulation (Bi-DexHands) and Google Research Football benchmarks. The former offers a range of challenging two-hand tasks, and the latter offers a range of cooperative scenarios within a football game.
Finally, since the Transformer model usually shows strong generalization performance on small sample tasks, the researchers believe that MAT can also have similar powerful performance on unseen MARL tasks. Generalization. Therefore, they designed zero-shot and small-shot experiments on SMAC and multi-agent MuJoCo tasks.
As shown in Table 1 and Figure 4 below, for the SMAC, multi-agent MuJoCo and Bi-DexHands benchmarks, MAT is It is significantly better than MAPPO and HAPPO on almost all tasks, indicating its powerful construction capabilities on homogeneous and heterogeneous agent tasks. Furthermore, MAT also achieves better performance than MAT-Dec, indicating the importance of decoder architecture in MAT design.
##Similarly, researchers on the Google Research Football benchmark Similar performance results were obtained, as shown in Figure 5 below.
MAT for few-shot learningZero-shot and few-shot examples for each algorithm are summarized in Table 2 and Table 3 Results, where bold numbers indicate the best performance.
The researchers also provided the performance of MAT with the same data, which was trained from scratch like the control group. As shown in the table below, MAT achieves most of the best results, which demonstrates the strong generalization performance of MAT's few-shot learning.
The above is the detailed content of StarCraft II cooperative confrontation benchmark surpasses SOTA, new Transformer architecture solves multi-agent reinforcement learning problem. For more information, please follow other related articles on the PHP Chinese website!