MoE and Mamba collaborate to scale state-space models to billions of parameters-AI-php.cn

Home

Technology peripherals

MoE and Mamba collaborate to scale state-space models to billions of parameters

王林

Jan 23, 2024 pm 06:00 PM

dataModel

State Space Model (SSM) is a technology that has attracted much attention and is considered as an alternative to Transformer. Compared with Transformer, SSM can achieve linear time reasoning when processing long context tasks, and has parallel training and excellent performance. In particular, Mamba, which is based on selective SSM and hardware-aware design, has shown outstanding performance and has become one of the powerful alternatives to the attention-based Transformer architecture.

Recently, researchers are also exploring combining SSM and Mamba with other methods to create more powerful architectures. For example, Machine Heart once reported that "Mamba can replace Transformer, but they can also be used in combination."

Recently, a Polish research team discovered that if SSM is combined with a hybrid expert system (MoE/Mixture of Experts), SSM can be expected to achieve large-scale expansion. MoE is a technology commonly used to extend Transformer. For example, the recent Mixtral model uses this technology. Please refer to the Heart of the Machine article.

The research result given by this Polish research team is MoE-Mamba, a model that combines Mamba and a hybrid expert layer.

MoE and Mamba collaborate to scale state-space models to billions of parameters

Paper address: https://arxiv.org/pdf/2401.04081.pdf

MoE -Mamba can improve the efficiency of SSM and MoE at the same time. And the team also found that MoE-Mamba behaved predictably when the number of experts varied.

The team conducted experimental demonstrations. The results showed that compared with Mamba, MoE-Mamba required 2.2 times fewer training steps with the same performance requirements, showing that the new method is comparable. Potential advantages over Transformer and Transformer-MoE. These preliminary results also point to a promising research direction: SSM may be scalable to tens of billions of parameters.

MoE and Mamba collaborate to scale state-space models to billions of parameters

##State space model

State Space Model (SSM) is a type of architecture used for sequence modeling. The ideas for these models originate from the field of cybernetics and can be viewed as a combination of RNN and CNN. Although they have considerable advantages, they also have some problems that prevent them from becoming the dominant architecture for language modeling tasks. However, recent research breakthroughs have allowed deep SSM to scale to billions of parameters while maintaining computational efficiency and strong performance.

Mamba

Mamba is a model built based on SSM, which can achieve linear time reasoning speed (for context length ), and it also achieves an efficient training process through hardware-aware design. Mamba uses a work-efficient parallel scan approach that mitigates the impact of loop sequentiality, while fused GPU operations eliminate the need to implement extended state. Intermediate states necessary for backpropagation are not saved but are recomputed during the backward pass, thereby reducing memory requirements. The advantage of Mamba over the attention mechanism is particularly significant in the inference stage because it not only reduces computational complexity, but also the memory usage does not depend on the context length.

Mamba solves the fundamental trade-off between efficiency and effectiveness of sequence models, which highlights the importance of state compression. An efficient model must have a small state, and an effective model must have a state that contains all the key information of the context. Unlike other SSMs that require temporal and input invariance, Mamba introduces a selection mechanism that controls how information is propagated along the sequence dimension. This design choice was inspired by an intuitive understanding of first-class synthesis tasks such as selective replication and induction, allowing the model to discern and retain critical information while filtering out irrelevant information.

Research has found that Mamba has the ability to efficiently utilize longer contexts (up to 1M tokens), and as the context length increases, the pre-training perplexity will also improve. The Mamba model is composed of stacked Mamba blocks and has achieved very good results in many different fields such as NLP, genomics, audio, etc. Its performance is comparable to and surpasses the existing Transformer model. Therefore, Mamba has become a strong candidate model for the general sequence modeling backbone model. Please refer to "Five times throughput, performance fully surrounds Transformer: New architecture Mamba detonates the AI circle 》.

Hybrid Expert

Mixed Expert (MoE) technology can greatly increase the number of parameters of the model, and at the same time Does not affect FLOPs required for model inference and training. MoE was first proposed by Jacobs et al. in 1991 and started to be used for NLP tasks in 2017 by Shazeer et al.

MoE has an advantage: the activations are very sparse - for each token processed, only a small part of the parameters of the model are used. Due to its computational requirements, the forward layer in the Transformer has become a standard target for several MoE techniques.

The research community has proposed various methods to solve the core problem of MoE, which is the process of assigning tokens to experts, also known as the routing process. There are currently two basic routing algorithms: Token Choice and Expert Choice. The former is to route each token to a certain number (K) of experts, while the latter is to route each token to a fixed number of experts.

Fedus et al. proposed Switch in the 2022 paper "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity". It is a Token Choice architecture that combines Each token is routed to a single expert (K=1), and they used this method to successfully expand the Transformer parameter size to 1.6 trillion. This team in Poland also used this MoE design in their experiments.

Recently, MoE has also begun to enter the open source community, such as OpenMoE.

Project address: https://github.com/XueFuzhao/OpenMoE

Particularly worth mentioning is Mistral’s open source Mixtral 8× 7B, its performance is comparable to LLaMa 2 70B, while the required inference computing budget is only about one-sixth of the latter.

Model Architecture

Although the main underlying mechanism of Mamba is quite different from the attention mechanism used in Transformer, Mamba retains the Transformer model high-level, module-based structure. Using this paradigm, one or more layers of identical modules are stacked on top of each other, and the output of each layer is added to a residual stream, see Figure 2. The final value of this residual stream is then used to predict the next token for the language modeling task.

MoE-Mamba takes advantage of the compatibility of both architectures. As shown in Figure 2, in MoE-Mamba, every interval Mamba layer is replaced by a Switch-based MoE feedforward layer.

MoE and Mamba collaborate to scale state-space models to billions of parameters

However, the team also noticed that this design is somewhat similar to the design of "Mamba: Linear-time sequence modeling with selective state spaces"; later The model alternately stacks Mamba layers and feedforward layers, but the resulting model is slightly inferior to pure Mamba. This design is denoted as Mamba-MLP in Figure 1 .

MoE-Mamba separates the unconditional processing of each token performed by the Mamba layer and the conditional processing performed by the MoE layer; the unconditional processing can efficiently integrate the entire context of the sequence into An internal representation, while conditional processing can use the most relevant experts for each token. This idea of alternating conditional and unconditional processing has been applied in some MoE-based models, but they usually alternate basic and MoE feedforward layers.

Key results

Training settings

The team compared 5 different settings: Basic Transformer, Mamba, Mamba-MLP, MoE and MoE-Mamba.

In most Transformers, the feedforward layer contains 8dm² parameters, while the Mamba paper makes Mamba smaller (about 6dm²), so that the number of parameters of two Mamba layers is the same as that of one feedforward layer and one The attention layers add up to about the same. To get roughly the same number of active parameters per token in Mamba and the new model, the team reduced the size of each expert forward layer to 6dm². Except for the embedding and unembedding layers, all models use approximately 26 million parameters per token. The training process uses 6.5 billion tokens and the number of training steps is 100k.

The data set used for training is the English C4 data set, and the task is to predict the next token. Text is tokenized using the GPT2 tokenizer. Table 3 gives the complete list of hyperparameters.

MoE and Mamba collaborate to scale state-space models to billions of parameters

Results

Table 1 gives the training results. MoE-Mamba performs significantly better than the regular Mamba model.

MoE and Mamba collaborate to scale state-space models to billions of parameters

Notably, MoE-Mamba achieved the same level of results as regular Mamba in just 46% of the training steps. Since the learning rate is adjusted for ordinary Mamba, it can be expected that if the training process is optimized for MoE-Mamba, MoE-Mamba will perform better.

Ablation Study

#To evaluate whether Mamba scales well as the number of experts grows, the researchers compared using different numbers of experts model.

Figure 3 shows the training run steps when using different numbers of experts.

MoE and Mamba collaborate to scale state-space models to billions of parameters

Table 2 gives the results after 100k steps.

MoE and Mamba collaborate to scale state-space models to billions of parameters

#These results show that the newly proposed method scales well with the number of experts. If the number of experts is 8 or more, the final performance of the new model is better than the normal Mamba. Since Mamba-MLP is worse than plain Mamba, it can be expected that MoE-Mamba using a small number of experts will perform worse than Mamba. The new method gave the best results when the number of experts was 32.

The above is the detailed content of MoE and Mamba collaborate to scale state-space models to billions of parameters. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

The AI Skills Gap Is Slowing Down Supply ChainsApr 26, 2025 am 11:13 AM

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

How One Company Is Quietly Working To Transform AI ForeverApr 26, 2025 am 11:12 AM

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Nvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentApr 26, 2025 am 11:11 AM

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI Paints A New Picture For The Future Of Art And DesignApr 26, 2025 am 11:10 AM

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

How Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesApr 26, 2025 am 11:09 AM

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

The Existential Threat To UniversitiesApr 26, 2025 am 11:08 AM

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The Prototype: American Scientists Are Looking For Jobs AbroadApr 26, 2025 am 11:07 AM

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

All About Open AI's Latest GPT 4.1 Family - Analytics VidhyaApr 26, 2025 am 10:19 AM

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

4 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

4 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),