GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better-AI-php.cn

Home

Technology peripherals

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better

王林

Jul 17, 2023 pm 04:57 PM

ModelTuning

Since the advent of GPT-4, people have been amazed by its powerful emergence capabilities, including excellent language understanding capabilities, generation capabilities, logical reasoning capabilities, etc. These capabilities make GPT-4 one of the most cutting-edge models in machine learning. However, OpenAI has not disclosed any technical details of GPT-4 so far.

Last month, George Hotz mentioned GPT-4 in an interview with an AI technology podcast called Latent Space, saying that GPT-4 is actually is a hybrid model. Specifically, George Hotez said that GPT-4 uses an integrated system composed of 8 expert models, each of which has 220 billion parameters (slightly more than the 175 billion parameters of GPT-3), and these Models are trained on different data and task distributions.

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better

Interview from Latent Space.

This may be just a speculation by George Hotez, but this model does have some legitimacy. Recently, a paper jointly published by researchers from Google, UC Berkeley, MIT and other institutions confirmed that the combination of hybrid expert model (MoE) and instruction tuning can significantly improve the performance of large language models (LLM).

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better Picture

Paper address: https://arxiv.org/pdf/2305.14705.pdf

The sparse mixed expert model is a special neural network architecture that can add learnable parameters to large language models (LLM) without increasing the cost of inference. Instruction tuning is a technique for training LLM to follow instructions. The study found that MoE models benefited more from instruction tuning than dense models, and therefore proposed to combine MoE and instruction tuning.

The study was conducted empirically in three experimental settings, including

in the absence of instruction tuning. Direct fine-tuning of a single downstream task;
After instruction tuning, perform in-context few-sample or zero-sample generalization on the downstream task;
Instruction tuning is followed by further fine-tuning of individual downstream tasks.

In the first case, the MoE model is generally inferior to a dense model with the same computational power. However, with the introduction of instruction tuning (the second and third cases), FLAN-MoE_32B (Fine-tuned LAnguage Net, abbreviated as Flan, is an instruction-tuned model, Flan-MoE is instruction tuning). Excellent MoE) outperforms FLAN-PALM_62B on four benchmark tasks, but only uses one-third of the FLOPs.

As shown in the figure below, before using instruction tuning, MoE→FT is not as good as T5→FT. After instruction tuning, Flan-MoE→FT outperforms Flan-T5→FT. MoE gains more from instruction tuning (15.6) than dense models (10.2):

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better Picture

It seems that GPT -4 There is some basis for adopting a hybrid model. MoE can indeed gain greater benefits from instruction tuning:

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better Picture

Method Overview

The researchers used sparse activation MoE (Mixture-of-Experts) in the FLAN-MOE (a set of sparse mixed expert models fine-tuned by instructions) model. Additionally, they replaced the feedforward components of other Transformer layers with MoE layers.

Each MoE layer can be understood as an "expert", and then the softmax activation function is used to model these experts to obtain a probability distribution.

Although each MoE layer has many parameters, the experts are sparsely activated. This means that for a given input token, only a limited subset of experts can complete the task, thus providing greater capacity to the model.

For a MoE layer with E experts, this effectively provides O (E^2) different feedforward network combinations, allowing for greater computational flexibility.

Since FLAN-MoE is an instruction-tuned model, instruction tuning is very important. This study fine-tuned FLAN-MOE based on the FLAN collective data set. Furthermore, this study adjusted the input sequence length of each FLAN-MOE to 2048 and the output length to 512.

Experiments and Analysis

On average, Flan-MoE performs better across all model scales without adding any additional computation. Better than its dense counterpart (Flan-T5).

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better Pictures

Number of experts. Figure 4 shows that as the number of experts increases, initially the model benefits from a richer set of specialized subnetworks, each capable of handling a different task or aspect in the problem space. This approach makes MoE highly adaptable and efficient in handling complex tasks, thereby improving performance overall. However, as the number of experts continues to increase, the model performance gains begin to decrease, eventually reaching a saturation point.

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better Picture

Figure 3 and Table 1 provide a detailed study of how different routing decisions affect instruction tuning performance: via FLAN- A comparison between the Switch and FLAN-GS strategies shows that activating more experts improves performance across the four benchmarks. Among these benchmarks, the MMLU-Direct model shows the most significant improvement, increasing from 38.0% to 39.9% for BASE/LARGE-sized models.

Notably, instruction tuning significantly amplified the performance of the MoE model in preserving MMLU, BBH, and internal QA and inference benchmarks compared to dense models of equivalent capacity . These advantages are further amplified for larger MoE models. For example, instruction tuning improves performance by 45.2% for ST_32B, while for FLAN-PALM_62B this improvement is relatively small at about 6.6%.

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better

When doing model extensions, Flan-MoE (Flan-ST-32B) outperforms Flan-PaLM-62B.

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better Picture

#In addition, this study freezes the gating function, expert module and MoE of the given model. Some analytical experiments were conducted on parameters. As shown in Table 2 below, experimental results show that freezing the expert module or MoE component has a negative impact on model performance.

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better

On the contrary, the freeze gating function will slightly improve the model performance, although it is not obvious. The researchers speculate that this observation is related to the underfitting of FLAN-MOE. The study also conducted ablation experiments to explore the fine-tuning data efficiency ablation study described in Figure 5 below.

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better

Finally, in order to compare the gap between direct fine-tuning of MoE and FLAN-MOE, this study conducted single-task fine-tuning of MoE, single-task Experiments were conducted on fine-tuned FLAN-MoE and dense models, and the results are shown in Figure 6 below:

GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better

Interested readers can read the original text of the paper to learn more More research content.

The above is the detailed content of GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

An easy-to-understand explanation of how to save conversation history (conversation log) in ChatGPT!May 16, 2025 am 05:41 AM

Various ways to efficiently save ChatGPT dialogue records Have you ever thought about saving a ChatGPT-generated conversation record? This article will introduce a variety of saving methods in detail, including official functions, Chrome extensions and screenshots, etc., to help you make full use of ChatGPT conversation records. Understand the characteristics and steps of various methods and choose the one that suits you best. [Introduction to the latest AI proxy "OpenAI Operator" released by OpenAI] (The link to OpenAI Operator should be inserted here) Table of contents Save conversation records using ChatGPT Export Steps to use the official export function Save ChatGPT logs using Chrome extension ChatGP

Create a schedule with ChatGPT! Explaining prompts that can be used to create and adjust tablesMay 16, 2025 am 05:40 AM

Modern society has a compact pace and efficient schedule management is crucial. Work, life, study and other tasks are intertwined, and prioritization and schedules are often a headache. Therefore, intelligent schedule management methods using AI technology have attracted much attention. In particular, ChatGPT's powerful natural language processing capabilities can automate tedious schedules and task management, significantly improving productivity. This article will explain in-depth how to use ChatGPT for schedule management. We will combine specific cases and steps to demonstrate how AI can improve daily life and work efficiency. In addition, we will discuss things to note when using ChatGPT to ensure safe and effective use of this technology. Experience ChatGPT now and get your schedule

How to connect ChatGPT with spreadsheets! A thorough explanation of what you can doMay 16, 2025 am 05:39 AM

We will explain how to link Google Sheets and ChatGPT to improve business efficiency. In this article, we will explain in detail how to use the add-on "GPT for Sheets and Docs" that is easy for beginners to use. No programming knowledge is required. Increased business efficiency through ChatGPT and spreadsheet integration This article will focus on how to connect ChatGPT with spreadsheets using add-ons. Add-ons allow you to easily integrate ChatGPT features into your spreadsheets. GPT for Shee

6 Investor Predictions For AI In 2025May 16, 2025 am 05:37 AM

There are overarching trends and patterns that experts are highlighting as they forecast the next few years of the AI revolution. For instance, there's a significant demand for data, which we will discuss later. Additionally, the need for energy is d

Use ChatGPT for writing! A thorough explanation of tips and examples of prompts!May 16, 2025 am 05:36 AM

ChatGPT is not just a text generation tool, it is a true partner that dramatically increases writers' creativity. By using ChatGPT for the entire writing process, such as initial manuscript creation, ideation ideas, and stylistic changes, you can simultaneously save time and improve quality. This article will explain in detail the specific ways to use ChatGPT at each stage, as well as tips for maximizing productivity and creativity. Additionally, we will examine the synergy that combines ChatGPT with grammar checking tools and SEO optimization tools. Through collaboration with AI, writers can create originality with free ideas

How to create graphs in ChatGPT! No plugins required, so it can be used for Excel too!May 16, 2025 am 05:35 AM

Data visualization using ChatGPT: From graph creation to data analysis Data visualization, which conveys complex information in an easy-to-understand manner, is essential in modern society. In recent years, due to the advancement of AI technology, graph creation using ChatGPT has attracted attention. In this article, we will explain how to create graphs using ChatGPT in an easy-to-understand manner even for beginners. We will introduce the differences between the free version and the paid version (ChatGPT Plus), specific creation steps, and how to display Japanese labels, along with practical examples. Creating graphs using ChatGPT: From basics to advanced use ChatG

Pushing The Limits Of Modern LLMs With A Dinner Plate?May 16, 2025 am 05:34 AM

In general, we know that AI is big, and getting bigger. It’s fast, and getting faster. Specifically, though, not everyone’s familiar with some of the newest hardware and software approaches in the industry, and how they promote better results. Peopl

Archive your ChatGPT conversation history! Explaining the steps to save and how to restore itMay 16, 2025 am 05:33 AM

ChatGPT Dialogue Record Management Guide: Efficiently organize and make full use of your treasure house of knowledge! ChatGPT dialogue records are a source of creativity and knowledge, but how can growing records be effectively managed? Is it time-consuming to find important information? don’t worry! This article will explain in detail how to effectively "archive" (save and manage) your ChatGPT conversation records. We will cover official archive functions, data export, shared links, and data utilization and considerations. Table of contents Detailed explanation of ChatGPT's "archive" function How to use ChatGPT archive function Save location and viewing method of ChatGPT archive records Cancel and delete methods for ChatGPT archive records Cancel archive Delete the archive Summarize Ch

See all articles