


LLM is very strong, and in order to achieve sustainable expansion of LLM, it is necessary to find and implement methods that can improve its efficiency. The hybrid expert (MoE) is an important member of this type of method.
Recently, the new generation of large models proposed by various technology companies are all using the Mixture of Experts (MoE) method.
The concept of hybrid experts was first born in the paper "Adaptive mixtures of local experts" in 1991. It has been extensively explored and developed for more than 30 years. In recent years, with the emergence and development of sparse-gated MoE, especially when combined with large-scale language models based on Transformer, this more than thirty-year-old technology has taken on new life.
The MoE framework is based on a simple yet powerful idea: different parts of the model (called experts) focus on different tasks or different aspects of the data.
When using this paradigm, for an input, only experts related to it will participate in the processing, so that the computational cost can be controlled while still benefiting from a large amount of professional knowledge. Therefore, MoE can improve the capabilities of large language models without significantly increasing computational requirements.
As shown in Figure 1, MoE-related research has grown strongly, especially after the emergence of Mixtral-8x7B and various industrial-level LLMs such as Grok-1, DBRX, Arctic, DeepSeek-V2, etc. in 2024.
This picture comes from a MoE review report recently released by a research team from the Hong Kong University of Science and Technology (Guangzhou). It clearly and comprehensively summarizes MoE-related research and proposes a new classification method. These studies are classified into three categories: algorithms, systems and applications.
Paper title: A Survey on Mixture of Experts
Paper address: https://arxiv.org/pdf/2407.06204
This site has compiled the main content of this review report. To help readers understand the current development overview of MoE, please read the original paper for more details. In addition, we have also compiled some MoE-related reports at the end of the article.
Background knowledge of hybrid experts
In a large language model (LLM) based on Transformer, the composition of each hybrid expert (MoE) layer is usually? "Expert Network" {?_1, ... , ?_?} Paired with a "gating network" G.
This gating network is usually in the form of a linear network using a softmax activation function, whose role is to guide the input to the appropriate expert network. The MoE layer is placed in the Transformer module, and its function is to select the forward network (FFN), usually located after the self-attention (SA) sub-layer. This placement is critical because as the model grows, the computational requirements of the FFN increase. For example, in the PaLM model with 540 billion parameters, 90% of the parameters are located in its FFN layer.
Described in mathematical form: Each expert network ?_? (usually a linear - ReLU - linear network) is parameterized by W_?, which receives the same input x and generates an output ?_? (x; W_? ). At the same time, a gated network G with parameters Θ (usually composed of a linear-ReLU-linear-softmax network) gets the output G (x; Θ). According to the design method of the gating function, the MoE layer can be roughly divided into the following two categories.
Dense MoE
The dense mixed expert layer is to activate all expert networks {?_1, ... , ?_?} during each iteration. This strategy was commonly adopted by early MoE studies. In recent times, there have been some studies using dense MoE, such as EvoMoE, MoLE, LoRAMoE and DS-MoE. Figure 2a gives the structure of the dense MoE layer. Therefore, the output of the dense MoE layer can be expressed as:
where, ?(x; Θ) is the gate value before the softmax operation.
Sparse MoE
Although the prediction accuracy of dense mixture experts is generally higher, its computational load is also very high.
In order to solve this problem, Shazeer et al.'s paper "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer" introduces a sparsely gated MoE layer, which can only activate the selected network in each forward pass. a certain subset of experts. This strategy achieves sparsity by computing a weighted sum of the outputs of the top-k experts rather than aggregating the outputs of all experts together. Figure 2b shows the structure of such a sparse MoE layer.
According to the framework proposed in the above paper, Equation 2.2 can be modified to reflect the sparse gating mechanism:
Explanation here: The TopK (・, ?) function retains only the first k items of the original value of the vector, while setting the other items to −∞. This is followed by a softmax operation where all −∞ terms become approximately zero. The hyperparameter k should be selected according to the specific application. Common options are ? = 1 or ? = 2. Adding the noise term R_noise is a common strategy for training sparsely gated MoE layers, which promotes exploration among experts and improves the stability of MoE training.
Although sparse gating G (x; Θ) can significantly expand the parameter space of the model without increasing the corresponding computational cost, it can also lead to load balancing problems. The load balancing problem refers to the uneven distribution of load among experts - some experts are used frequently, while others are used rarely or not at all.
In order to solve this problem, each MoE layer must integrate an auxiliary loss function, whose role is to urge each batch of tokens to be evenly distributed to each expert. From the mathematical form description, first define a query batch containing T tokens B = {x_1, x_2, ..., x_?} and N experts. Then its auxiliary load balancing loss is defined as:
where D_i is the proportion of tokens assigned to expert i, and P_i is the proportion of gating probability assigned to expert i. To ensure that the batch is evenly distributed among the N experts, the load balancing loss function L_{load-balancing} should be minimized. When each expert is assigned the same number of tokens D_? = 1/? and the same gating probability P_? = 1/?, the optimal condition is reached:
At this time, the load of each expert reaches balance.
In the following, unless otherwise explicitly stated, the term "MoE" refers only to "sparse MoE".
Classification of hybrid experts
In order to help researchers find targets in LLM research that uses MoE in large numbers, the team developed a set of classification methods to classify these models according to three aspects: algorithm design, system design and applications.
Figure 3 shows this classification method and some representative research results.
The following will provide a comprehensive and in-depth introduction to each category.
Algorithm design of mixed experts
Gating function
Gating function (also known as routing function or router) is the basic component of all MoE architectures. Its role is to coordinate the use of expert calculations and combine experts Output.
The gating can be divided into three types based on the processing method for each input: sparse, dense and soft. The sparse gating mechanism activates some experts, the dense gating mechanism activates all experts, and the soft gating mechanism includes completely differentiable methods, including input token fusion and expert fusion. Figure 4 illustrates the various gating functions used in the MoE model. The sparse gating function activates selected experts when processing each input token, which can be regarded as a form of conditional calculation.
The gating function can implement various forms of gating decisions, such as binary decision-making, sparse or continuous decision-making, random or deterministic decision-making; it has been studied in depth and can use various forms of reinforcement learning and reverse Spread to train.
- Later, this paradigm became the dominant paradigm in the field of MoE research. Because this method selects an expert for each input token, it can be thought of as a token-selective gating function. The following are the main points of this section, see the original paper for details:
- Other advances in token selective gating
- Untrainable token selective gating
- Expert selective gating
- intensive
Although sparse MoE has advantages in efficiency, the direction of dense MoE is still welcoming innovation. In particular, dense activation performs well on LoRA-MoE fine-tuning with relatively low computational overhead for LoRA experts. This approach enables efficient and flexible integration of multiple LoRAs to complete various downstream tasks. This preserves the generative capabilities of the original pre-trained model while preserving the unique characteristics of each LoRA for each task.
soft formula
For sparse MoE, a fundamental discrete optimization problem is how to decide which appropriate experts to assign to each token. To ensure balanced expert participation and minimize unallocated tokens, this often requires heuristic-assisted losses. This problem is particularly significant in scenarios involving out-of-distribution data (such as small inference batches, novel inputs, or transfer learning).
Similar to dense MoE, soft MoE methods also use all experts when processing each input, thereby maintaining full differentiability and thus avoiding the inherent problems of discrete expert selection methods. The difference between soft MoE and dense MoE is that the former alleviates computational requirements through gated and weighted fusion of input tokens or experts.
Experts
This section will introduce the architecture of the expert network within the MoE framework and discuss the gating function that coordinates the activation of these experts.
Network Types
Since MoE was integrated into the Transformer architecture, it often replaces the forward network (FFN) module in these models. Typically, each expert in the MoE layer replicates the architecture of the FFN it replaces.
This paradigm of using FFN as an expert is still mainstream, but people have also made many improvements to it.
Hyperparameters
The scale of the sparse MoE model is controlled by several key hyperparameters, including:
Number of experts per MoE layer
Size of each expert
MoE How often layers are placed throughout the model
The choice of these hyperparameters is crucial as it profoundly affects the performance and computational efficiency of the model in various tasks. Therefore, the optimal hyperparameters are selected based on the specific application requirements and computing infrastructure. Table 2 shows some configurations of models using MoE.
In addition, Table 3 lists the number of parameters and benchmark performance of some recent open source models.
Activation function
The sparse MoE model built on the dense Transformer architecture adopts an activation function similar to leading dense LLMs such as BERT, T5, GPT and LLAMA. Activation functions have evolved from ReLU to more advanced options such as GeLU, GeGLU, SwiGLU, and more.
This trend also extends to other components of MoE models, which often incorporate techniques such as Root Mean Square Layer Normalization (RMSNorm), Grouped Query Attention (GQA), and Rotated Position Embedding (RoPE).
Shared Experts
DeepSpeed-MoE innovatively introduces the Residual-MoE (Residual-MoE) architecture, in which each token is processed by a fixed expert plus a gate-selected expert, achieving each Two experts are involved in the processing on each layer, while the communication cost will not exceed the top-1 gating method. This approach treats the gating-selected MoE expert as an error correction aid for fixed dense FFNs.
The conditional MoE routing (CMR/Conditional MoE Routing) used in NLLB also adopts a similar approach, combining the output of dense FFN and MoE layers.
The paradigm that integrates fixed FFN and sparse MoE is often called shared experts, as shown in Figure 5b.
Models such as DeepSeekMoE, OpenMoE, Qwen1.5-MoE and MoCLE have recently adopted this paradigm, indicating that it is becoming a mainstream configuration. However, DeepSeekMoE and Qwen1.5-MoE use multiple shared experts instead of a single one.
Hybrid parameter-efficient fine-tuning expert
Parameter-efficient fine-tuning (PEFT) is a method to improve fine-tuning efficiency. Simply put, PEFT updates only a small part of the parameters of the base model during fine-tuning.
PEFT is successful, but due to its limited trainable parameters and possible catastrophic forgetting problems, this method is difficult to use in situations where generalization to multiple tasks is required.
To alleviate these limitations, Mixed Parameter Efficient Expert (MoPE) was born, which integrates the MoE framework and PEFT. MoPE integrates MoE's gating mechanism and multi-expert architecture, and each expert is built using PEFT technology. This clever combination can greatly improve the performance of PEFT in multi-task scenarios. In addition, since PEFT is used to build experts, MoPE also uses fewer parameters and is much more resource efficient than the traditional MoE model.
MoPE combines the multi-tasking characteristics of MoE and the resource efficiency of PEFT, which is a very promising research direction. Figure 6 classifies MoPEs according to their position in the Transformer model architecture. For a more detailed introduction to research results on MoPE, please refer to the original paper.
Training and inference solutions
Hybrid experts are progressing and developing, and related training and inference solutions are also progressing and developing.
The initial training and inference solution requires training the MoE model from scratch and directly using the trained model configuration to perform inference.
But now, many new paradigms have emerged in the training and inference of MoE models, including combining the advantages of dense and sparse models to complement each other.
Figure 7 shows the training and inference solutions related to MoE. It can be seen that the emerging solutions can be divided into three categories:
Dense to sparse: starting from dense model training and gradually transitioning to sparse MoE Configuration;
Sparse to dense: involves downgrading the sparse MoE model to a dense form, which is beneficial to implementing inference into a hardware form;
Expert model fusion: integrating multiple pre-trained dense expert models into one Unified MoE model.
Derived Technologies from MoE
Mixed Experts (MoE) have inspired many different variant technologies. For example, Xue et al.'s paper "Go wider instead of deeper" proposes WideNet with increased model width. The method is to replace the forward network (FFN) with the MoE layer while maintaining the shared trainability on the Transformer layer. parameters, except for the normalization layer.
In addition, there are SYT (Sparse Universal Transformer) proposed by Tan et al., MoT (Hybrid Token) proposed by Antoniak et al., SMoP (Sparse Mixed Prompter) proposed by Choi et al., and Chen et al. Lifelong-MoE, MoD (mixing depth) proposed by Raposo et al., etc.
To summarize, the development of MoE-derived technologies reveals a trend: MoE has more and more functions and is increasingly adaptable to different fields.
System Design of Mixed Experts
While Mixed Experts (MoE) can enhance the capabilities of large language models, it also brings new technical challenges because of its sparse and dynamic computational load.
GShard introduces expert parallelism, which can schedule segmented partial tokens according to the load balancing constraints of expert capabilities, thereby achieving parallel gating and expert calculations. This paradigm has become a fundamental strategy to promote efficient scaling of MoE models. We can think of this approach as an enhanced version of data parallelism - each expert in the MoE layer is assigned to a different device, while all non-expert layers are duplicated on all devices.
As shown in Figure 8a, the workflow of expert parallelization is to perform the following operations in sequence: gate routing, input encoding, All-to-All scheduling, expert calculation, All-to-All combination, and output decoding.
Generally speaking, the input size of GEMM needs to be large enough to fully utilize the computing device. Therefore, input encoding is used to aggregate the input tokens of the same expert into a continuous memory space, which is determined by the "token-expert mapping" in the gate routing. Afterwards, the role of All-to-All scheduling is to distribute the input tokens to the corresponding experts on each device. This is followed by expert localization calculations. After the calculation is completed, it is summarized through All-to-All combination, then the output is decoded, and the layout of the original data is restored according to the gating index.
In addition, some researchers are exploring the synergy between expert parallelism and other existing parallel strategies (such as tensors, pipelines, sequence parallelization) to improve the scalability and efficiency of MoE models in large-scale distributed environments.
Some hybrid parallelization examples are given in Figure 8, including (b) data + expert + tensor parallelization, (c) data + expert + pipeline parallelization, (d) expert + tensor parallelization.
It is necessary to realize that there is a complex interaction between computing efficiency, communication load, and memory usage. The choice of distributed parallelization strategy will affect it and will also be affected by different hardware configurations. Therefore, when deploying strategies for practical applications, careful trade-offs must be made and adjustments must be made to specific scenarios.
After that, the team introduced the system design challenges faced by MoE model development and the research results to solve these problems in three major sections: computing, communication and storage. Please see the original paper for details. Table 4 gives an overview of open source MoE frameworks.
Apps for Mixing Experts
現在 Transformer が独占している大規模言語モデル (LLM) の分野では、混合エキスパート (MoE) パラダイムは、トレーニングと推論の段階に過剰な計算要件を導入することなくモデルの機能を大幅に向上できるため、非常に魅力的です。このタイプのテクノロジーは、さまざまな下流タスクで LLM のパフォーマンスを大幅に向上させ、人間のレベルを超える AI アプリケーションを作成することさえできます。
非常に強力な GPT-4 は、2,200 億のパラメータを持つ 8 人の専門家で構成され、多様なデータセットとタスクでトレーニングされ、16 回の推論プロセスを反復する、ある種の MoE アーキテクチャを採用する可能性があるという噂もあります。この噂の詳細については、当サイトのレポート「究極の“暴露”:GPT-4モデルのアーキテクチャ、トレーニングコスト、データセット情報が明らかに」を参照してください。
したがって、MoE が自然言語処理、コンピューター ビジョン、レコメンデーション システム、およびマルチモーダル アプリケーションで開花しているのは驚くべきことではありません。
これらのアプリケーションでは基本的に、条件付き計算を使用してモデルのパラメータ数を大幅に増やして固定コンピューティングコストの下でモデルのパフォーマンスを向上させるか、効率的なマルチタスク学習を達成するためにゲートメカニズムを介して動的なエキスパート選択を実装する必要があります。 。
チームは、これらのさまざまな分野における代表的な MoE アプリケーションも紹介しました。これは、読者が特定のタスクに MoE を使用する方法を理解するのに役立ちます。詳細については元の論文を参照してください。
課題と機会
ハイブリッド エキスパート、強力、コストを削減し、パフォーマンスを向上させます。見通しは良好ですが、課題はまだあります。
このセクションでは、チームは環境省に関連する主要な課題を整理し、重要な結果の達成が期待できる将来の研究の方向性を指摘します。これらの課題と研究の方向性を以下に簡単に示します。詳細については元の論文を参照してください。
トレーニングの安定性と負荷分散
スケーラビリティと通信オーバーヘッド
専門家の専門化とコラボレーション
スパースな活性化と計算効率
一般化とロバスト性
解釈性と透明性
最適な専門家アーキテクチャ
既存のフレームワークとの統合
詳細: MoE関連レポート
基本:
フロンティア:
オープンソースの大型モデルの王座が再び移り変わり、1,320億のパラメータのDBRXがオンラインになり、基本モデルと微調整モデルの両方が利用可能です
- CVPR 2024 | MoE に基づく一般的な画像融合モデル。複数のタスクを完了するために 2.8% のパラメーターが追加されます
- CVPR 2023 |視覚的なマルチタスク学習のためのモデル
- Google Gemini 1.5が迅速にリリース: MoEアーキテクチャ、100万コンテキスト
- Apple大型モデルMM1が市場に参入: 300億パラメータ、マルチモーダル、MoEアーキテクチャ、著者の半数以上が中国人です
-
MoEのトレーニング効率とパフォーマンスのボトルネックを打破し、Huawei Panguスパースラージモデルの新しいアーキテクチャLocMoEがリリースされました
Mistralオープンソースの8X22B大規模モデル、OpenAIがGPT-4 Turboビジョンを更新、それらはすべてGoogleをいじめています
The above is the detailed content of Algorithms, systems and applications, a comprehensive understanding of hybrid experts (MoE) from three perspectives. For more information, please follow other related articles on the PHP Chinese website!

Laravel入门教程:从零开始学习最流行的PHP框架引言:Laravel是当前最流行的PHP框架之一,它易于上手、功能强大且拥有活跃的开发社区。本文将带您从零开始学习Laravel框架,并提供一些实例代码,帮助您更好地理解和掌握这个强大的工具。第一步:安装Laravel在开始之前,您需要在计算机上安装Laravel框架。最简单的方法是通过Composer进

Vue.js是一款流行的JavaScript前端框架,目前已经推出了最新的版本——Vue3,新版Vue在性能、体积以及开发体验上均有所提升,受到越来越多的开发者欢迎。本文将介绍如何使用Vue3制作一个简单的图片裁剪器。首先,我们需要创建一个Vue项目并安装所需的插件。可以使用VueCLI来创建项目,也可以手动搭建。这里我们以使用VueCLI的方式为例:#

Go-zero是一款优秀的Go语言框架,它提供了一整套解决方案,包括RPC、缓存、定时任务等功能。事实上,使用go-zero建立一个高性能的服务非常简单,甚至可以在数小时内从入门到精通。本文旨在介绍使用go-zero框架构建高性能服务的过程,并帮助读者快速掌握该框架的核心概念。一、安装和配置在开始使用go-zero之前,我们需要安装它并配置一些必要的环境。1

快速入门:使用Go语言函数实现简单的数据可视化功能随着数据的快速增长和复杂性的提高,数据可视化成为了数据分析和数据表达的重要手段。在数据可视化中,我们需要使用合适的工具和技术来将数据转化为易读且易理解的图表或图形。Go语言作为一种高效且易于使用的编程语言,在数据科学领域也有着广泛的应用。本文将介绍如何使用Go语言函数来实现简单的数据可视化功能。我们将使用Go

Beego是一个基于Go语言的开发框架,它提供了一套完整的Web开发工具链,包括路由、模板引擎、ORM等。如果你想快速入门Beego开发框架,以下是一些简单易懂的步骤和建议。第一步:安装Beego和Bee工具安装Beego和Bee工具是开始学习Beego的第一步。你可以在Beego官网上找到详细的安装步骤,也可以使用以下命令来安装:gogetgithub

随着科技的不断发展,人脸识别技术也越来越得到了广泛的应用。而在Web开发领域中,PHP是一种被广泛采用的技术,因此PHP中的人脸识别技术也备受关注。本文将介绍PHP中的人脸识别入门指南,帮助初学者快速掌握这一领域。一、什么是人脸识别技术人脸识别技术是一种基于计算机视觉技术的生物特征识别技术,其主要应用领域包括安防、金融、电商等。人脸识别技术的核心就是对人脸进

Laravel是一个流行的PHP框架,它提供了许多工具和功能,以使开发Web应用程序变得更加轻松和快速。Laravel8已经发布,它带来了许多新的功能和改进。在本文中,我们将学习如何快速入门Laravel8。安装Laravel8要安装Laravel8,您需要满足以下要求:PHP>=7.3MySQL>=5.6或MariaDB>=10.

PHP摄像头调用教程:快速入门指南引言:在当今的数字时代,摄像头成为了人们生活中不可或缺的设备之一。在Web开发中,如何通过PHP调用摄像头,实现视频流的显示和处理,成为了很多开发者关注的问题。本文将为大家介绍如何快速入门使用PHP来调用摄像头。一、环境准备要使用PHP调用摄像头,我们需要准备以下环境:PHP:确保已经安装了PHP,并且安装了相应的扩展库,如


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version
Chinese version, very easy to use
