Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids-AI-php.cn

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

王林

Aug 11, 2024 pm 01:02 PM

projectMixing Expert

Hybrid experts also have specializations in the field of surgery.

For the current mixed-modality basic model, the common architectural design is to fuse the encoder or decoder of a specific modality, but this method has limitations: it cannot integrate information from different modalities, and it is difficult to output. Contains content in multiple modalities.

In order to overcome this limitation, the Chameleon team of Meta FAIR proposed a new single Transformer architecture in the recent paper "Chameleon: Mixed-modal early-fusion foundation models", which can be based on the next token. The prediction goal is to model mixed-modal sequences composed of discrete image and text tokens, allowing for seamless inference and generation between different modalities.

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

After completing pre-training on approximately 10 trillion mixed-modal tokens, Chameleon has demonstrated a wide range of visual and language capabilities and can handle a variety of different downstream tasks well. Chameleon's performance is particularly impressive in the task of generating mixed-modal long answers. It even beats commercial models such as Gemini 1.0 Pro and GPT-4V. However, for a model like Chameleon where various modalities are mixed in the early stages of model training, expanding its capabilities requires investing a lot of computing power.

Based on the above problems, the Meta FAIR team conducted some research and exploration on routed sparse architecture and proposed MoMa: Modality-aware expert hybrid architecture.

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

論文標題：MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
論文地址：https://arxiv.org/pdf/2407.21770有研究表明，這類架構可以有效地擴展單模態的基礎模型的能力，也可以增強多模態對比學習模型的性能。但是，將其用於較早將各種模態融合的模型訓練還是一個機遇與挑戰並存的課題，還少有人研究。

該團隊的研究基於這一洞見：不同模態具有固有的異構性 —— 文字和圖像 token 具有不同的資訊密度和冗餘模式。

在將這些 token 整合成統一的融合架構的同時，該團隊也提出透過整合針對具體模態的模組來進一步優化該框架。團隊將此概念稱為模態感知型稀疏性（modality-aware sparsity），簡稱MaS；其能讓模型更好地捕捉每個模態的特徵，同時還能透過部分參數共享和注意力機制維持強大的跨模態整合性能。

之前的VLMo、BEiT-3 和VL-MoE 等研究已經採用了混合模態專家（MoME/mixture-of-modality-experts）方法來訓練視覺- 語言編碼器和掩碼式語言構建模，來自FAIR 的研究團隊更進一步將MoE 的可用範圍又推進了一步。

模型架構

早期融合

早期融合

系列離散token。 Chameleon 的核心是一個基於 Transformer 的模型，其會在圖像和文字 token 的組合序列上應用自註意力機制。這能讓此模型捕捉模態內和模態間的複雜關聯。此模型的訓練使用的目標是下一 token 預測目標，以自回歸方式產生文字和圖像 token。

在 Chameleon 中，圖像的 token 化方案採用了一個學習型圖像分詞器，它將基於大小為 8192 的 codebook 將 512 × 512 的圖像編碼成 1024 個離散 token。對於文本的分詞將使用一個詞表大小為 65,536 的 BPE 分詞器，其中包含圖像 token。這種統一的分詞方法可以讓模型無縫處理圖像和文字 token 交織錯雜的任意序列。

借助這種方法，新模型繼承了表徵統一、靈活性好、可擴展性高、支持端到端學習這些優點。

在此基礎上（圖 1a），為了進一步提升早期融合模型的效率和效能，團隊也引進了模態感知型稀疏性技術。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

寬度擴展：模態感知型混合專家

該團隊提出了一種寬度感知方法：將模態感知型模組稀疏性標準混合專家（MoE）架構。

此方法基於此洞見：不同模態的 token 有各自不同的特徵和資訊密度。

透過為每個模態建立不同的專家分組，可讓模型開發出專門的處理路徑，同時維持跨模態的資訊整合能力。

圖 1b 展示了這種模態感知型專家混合（MoMa）的關鍵組件。簡單來說，先將各個特定模態的專家分組，然後實現分層路由（分為模態感知型路由和模態內路由），最後選擇專家。詳細過程請參考原論文。

整體來說，對於一個輸入 token x，MoMa 模組的形式化定義為：

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

在 MoMa 計算之後，該團隊又進一步使用了殘差連接和 Swin Transformer 歸一化。

Mixture-of-Depths（MoD）

之前也有研究者將某些易性引入維度，他們的方法要麼是使用深度學習的路由器。

該團隊的做法參考了第二種方法，同時整合了近期提出的混合深度（MoD）技術。更多 MoD 的介紹可參閱本站報道《DeepMind 升級 Transformer，前向通過 FLOPs 最多可降一半》。

具體而言，如下圖所示，該團隊的做法是在每個MoD 層中，在混合專家（MoE）路由之前都集成MoD，從而確保在模態分離之前，整批數據都能應用MoD。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

推理

在推理階段，我們不能直接使用MoE 的專家選擇路由或MoD 的層數據選擇路由，因為在一批數據中進行 top ）選擇會破壞因果關係。

為了保證推理的因果關係，受上述MoD 論文的啟發，研究團隊引入了輔助路由器（auxiliary router），其作用是僅基於token 的隱藏表徵預測該token 被某個專家或層選中的可能性。

升級改造（Upcycling）

在優化表徵空間和路由機制方面，對於一個從頭開始訓練 MoE 架構，存在一個獨特的難題。團隊發現：MoE 路由器負責為每個專家劃分錶徵空間。但是，在模型訓練的早期階段，這個表徵空間並非最優，這就會導致訓練得到的路由函數也是次優的。

為了克服這一局限，他們基於 Komatsuzaki 等人的論文《Sparse upcycling: Training mixture-of-experts from dense checkpoints》提出了一種升級改造方法。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

具體來說，首先訓練一個每個模態都有一個 FFN 專家的架構。經過一些預先設定的步數之後，再對該模型進行升級改造，具體做法是：將每個特定模態的FFN 轉換成一個專家選擇式MoE 模組，並將每個專家初始化為第一階段訓練的專家。這裡會在保留前一階段的資料載入器狀態的同時重置學習率調度器，以確保第二階段的訓練能使用已刷新的資料。

為了促進專家更加專業，該團隊還使用了 Gumbel 噪聲來增強 MoE 路由函數，從而使得新的路由器能以可微分的方式對專家進行採樣。

這種升級改造方法加上 Gumbel-Sigmoid 技術，可克服學習到的路由器的局限性，從而提升新提出的模態感知型稀疏架構的性能。

效率優化

為促進 MoMa 的分散式訓練，該團隊採用了完全分片式資料並行（FSDP/Fully Sharded Data Parallel）。但是，相較於常規 MoE，此方法存在一些特有的效率難題，包括負載平衡問題和專家執行的效率問題。

對於負載平衡問題，該團隊開發了一種平衡的資料混合方法，可讓每台 GPU 上的文字 - 影像資料比例與專家比例保持一致。

對於專家執行的效率問題，該團隊探索了一些策略，可幫助提升不同模態的專家的執行效率：

將各個模態的專家限制為同構的專家，並禁止將文字token 路由到圖像專家，反之亦然；
使用模組稀疏性（block sparsity）來提升執行效率；
當模態的數量有限時，按順序運行不同模態的專家。

由於實驗中每台 GPU 處理的 token 都足夠多，因此即使使用多個分批次矩陣乘法，硬體利用率也不算大問題。因此，團隊認為對於目前規模的實驗環境而言，依序執行的方法是比較好的選擇。

其它優化

為了進一步提升吞吐量，該團隊還採用了其它一些優化技術。

其中包括降低梯度通訊量、自動化的 GPU 核融合等一般最佳化操作，研究團隊也透過 torch.compile 實現了圖優化。

此外，他們還針對 MoMa 開發了一些優化技術，包括跨不同層復用模態 token 索引，以最高效地同步 CPU 和 GPU 之間的設備。

實驗

設定

實驗中所使用的預處理資料集和預處理過程一樣與訓練過程中所使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中使用的預處理資料集一樣與訓練過程中所使用的預處理資料集相同為了評估擴展效能，他們訓練模型使用的 token 數量超過 1 兆。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

表 1 給出了密集和稀疏模型的詳細配置。

不同計算層級的擴展性能

該團隊分析了不同模型在不同計算層級上的擴展性能，這些計算層級（FLOPs）相當於三種大小的密集模型：90M、M 435M 和1.4B。

實驗結果表明，一個稀疏模型僅使用總 FLOPs 的 1/η 就能比肩膀同等 FLOPs 的密集模型的預訓練損失（η 表示預訓練加速因子）。

模態解綁

引入特定模態的專家分組可提高不同規模模型的預訓練效率，這對影像模態尤其有益。如圖 3 所示，使用 1 個影像專家和 1 個文字專家的 moe_1t1i 配置顯著優於對應的密集模型。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

擴展每個模態分組的專家數量還能進一步提升模型效能。

混合深度與專家

該團隊觀察到，當採用 MoE 和 MoD 以及它們的組合形式時，訓練損失的收斂速度會得到提升。如圖 4 所示，在 moe_1t1i 架構中加入 MoD（mod_moe_1t1i）可大幅提升不同模型大小的模型效能。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

此外，在不同的模型大小和模態上，mod_moe_1t1i 能媲美甚至超過 moe_4t4i，這表明在深度維度上引入稀疏性也能有效提升訓練效率。

另一方面，還能看到堆疊 MoD 和 MoE 的收益會逐漸下降。

擴展專家的數量

為了研究擴展專家數量的影響，該團隊進行了進一步的消融實驗。他們探索了兩種場景：為每種模態分配同等數量的專家（平衡）以及為每種模態分配不同數量的專家（不平衡）。結果見圖 5。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

對於平衡的設置，從圖 5a 可以看到，隨著專家數量提升，訓練損失會明顯下降。但文字和圖像損失表現出了不同的擴展模式。這表明每種模態的固有特性會導致不同的稀疏建模行為。

對於不平衡的設置，圖 5b 比較了同等專家總數（8）的三種不同配置。可以看到，一個模態的專家越多，模型在該模態上的表現通常就越好。

升級改造

該團隊自然也驗證了前述的升級改造的效果。圖 6 比較了不同模型變體的訓練曲線。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

結果表明，升級改造確實能進一步改善模型訓練：當第一個階段有10k 步時，升級改造能帶來1.2 倍的FLOPs 收益；而當這個步數為20k 時，也有1.16 倍的FLOPs 收益。

此外，還能觀察到，隨著訓練推進，經過升級改造的模型與從頭開始訓練的模型之間的性能差距會不斷增大。

輸送量分析

稀疏模型通常無法立即帶來效能增益，因為稀疏模型會增加動態性和相關的資料平衡問題。為了量化新提出的方法對訓練效率的影響，團隊通常控制變數實驗比較了不同架構的訓練吞吐量。結果見表 2。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

可以看到，相比於密集模型，基於模態的稀疏性能實現更好的質量 - 吞吐量權衡，並且能隨專家數量增長展現出合理的可擴展性。另一方面，儘管 MoD 變體取得了最好的絕對損失，但由於額外的動態性和不平衡性，它們的計算成本往往也更高。

推理時間表現

該團隊也評估了模型在留存的語言建模資料和下游任務上的表現。結果見表 3 和 4。

Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids

如表3 所示，透過使用多個圖像專家，1.4B MoMa 1t1i 模型在大多數指標上都優於相應的密集模型，只有在COCO 和Flickr 上的圖像到文本條件困惑度指標例外。進一步擴展專家數量也能提升效能，其中 1.4B MoE 8x 在圖像到文字效能上達到了最佳。

此外，如表 4 所示，1.4B MoE 8x 這個模型也非常擅長文本到文本任務。 1.4B MoMa 4t4i 在所有條件影像困惑度指標上表現最佳，而其在大多數基準上的文字困惑度也非常接近 1.4B MoE 8x。

整體而言，在混合文字和影像兩種模態的資料上，1.4B MoMa 4t4i 模型的建模結果最佳。

更多詳細內容，請閱讀原論文。

The above is the detailed content of Hybrid experts are more assertive and can perceive multiple modalities and act according to the situation. Meta proposes modality-aware expert hybrids. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles