


GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better
Since the advent of GPT-4, people have been amazed by its powerful emergence capabilities, including excellent language understanding capabilities, generation capabilities, logical reasoning capabilities, etc. These capabilities make GPT-4 one of the most cutting-edge models in machine learning. However, OpenAI has not disclosed any technical details of GPT-4 so far.
Last month, George Hotz mentioned GPT-4 in an interview with an AI technology podcast called Latent Space, saying that GPT-4 is actually is a hybrid model. Specifically, George Hotez said that GPT-4 uses an integrated system composed of 8 expert models, each of which has 220 billion parameters (slightly more than the 175 billion parameters of GPT-3), and these Models are trained on different data and task distributions.
Interview from Latent Space.
This may be just a speculation by George Hotez, but this model does have some legitimacy. Recently, a paper jointly published by researchers from Google, UC Berkeley, MIT and other institutions confirmed that the combination of hybrid expert model (MoE) and instruction tuning can significantly improve the performance of large language models (LLM).
Picture
Paper address: https://arxiv.org/pdf/2305.14705.pdf
The sparse mixed expert model is a special neural network architecture that can add learnable parameters to large language models (LLM) without increasing the cost of inference. Instruction tuning is a technique for training LLM to follow instructions. The study found that MoE models benefited more from instruction tuning than dense models, and therefore proposed to combine MoE and instruction tuning.
The study was conducted empirically in three experimental settings, including
- in the absence of instruction tuning. Direct fine-tuning of a single downstream task;
- After instruction tuning, perform in-context few-sample or zero-sample generalization on the downstream task;
- Instruction tuning is followed by further fine-tuning of individual downstream tasks.
In the first case, the MoE model is generally inferior to a dense model with the same computational power. However, with the introduction of instruction tuning (the second and third cases), FLAN-MoE_32B (Fine-tuned LAnguage Net, abbreviated as Flan, is an instruction-tuned model, Flan-MoE is instruction tuning). Excellent MoE) outperforms FLAN-PALM_62B on four benchmark tasks, but only uses one-third of the FLOPs.
As shown in the figure below, before using instruction tuning, MoE→FT is not as good as T5→FT. After instruction tuning, Flan-MoE→FT outperforms Flan-T5→FT. MoE gains more from instruction tuning (15.6) than dense models (10.2):
Picture
It seems that GPT -4 There is some basis for adopting a hybrid model. MoE can indeed gain greater benefits from instruction tuning:
Picture
Method Overview
The researchers used sparse activation MoE (Mixture-of-Experts) in the FLAN-MOE (a set of sparse mixed expert models fine-tuned by instructions) model. Additionally, they replaced the feedforward components of other Transformer layers with MoE layers.
Each MoE layer can be understood as an "expert", and then the softmax activation function is used to model these experts to obtain a probability distribution.
Although each MoE layer has many parameters, the experts are sparsely activated. This means that for a given input token, only a limited subset of experts can complete the task, thus providing greater capacity to the model.
For a MoE layer with E experts, this effectively provides O (E^2) different feedforward network combinations, allowing for greater computational flexibility.
Since FLAN-MoE is an instruction-tuned model, instruction tuning is very important. This study fine-tuned FLAN-MOE based on the FLAN collective data set. Furthermore, this study adjusted the input sequence length of each FLAN-MOE to 2048 and the output length to 512.
Experiments and Analysis
On average, Flan-MoE performs better across all model scales without adding any additional computation. Better than its dense counterpart (Flan-T5).
Pictures
Number of experts. Figure 4 shows that as the number of experts increases, initially the model benefits from a richer set of specialized subnetworks, each capable of handling a different task or aspect in the problem space. This approach makes MoE highly adaptable and efficient in handling complex tasks, thereby improving performance overall. However, as the number of experts continues to increase, the model performance gains begin to decrease, eventually reaching a saturation point.
Picture
Figure 3 and Table 1 provide a detailed study of how different routing decisions affect instruction tuning performance: via FLAN- A comparison between the Switch and FLAN-GS strategies shows that activating more experts improves performance across the four benchmarks. Among these benchmarks, the MMLU-Direct model shows the most significant improvement, increasing from 38.0% to 39.9% for BASE/LARGE-sized models.
Notably, instruction tuning significantly amplified the performance of the MoE model in preserving MMLU, BBH, and internal QA and inference benchmarks compared to dense models of equivalent capacity . These advantages are further amplified for larger MoE models. For example, instruction tuning improves performance by 45.2% for ST_32B, while for FLAN-PALM_62B this improvement is relatively small at about 6.6%.
When doing model extensions, Flan-MoE (Flan-ST-32B) outperforms Flan-PaLM-62B.
Picture
#In addition, this study freezes the gating function, expert module and MoE of the given model. Some analytical experiments were conducted on parameters. As shown in Table 2 below, experimental results show that freezing the expert module or MoE component has a negative impact on model performance.
On the contrary, the freeze gating function will slightly improve the model performance, although it is not obvious. The researchers speculate that this observation is related to the underfitting of FLAN-MOE. The study also conducted ablation experiments to explore the fine-tuning data efficiency ablation study described in Figure 5 below.
Finally, in order to compare the gap between direct fine-tuning of MoE and FLAN-MOE, this study conducted single-task fine-tuning of MoE, single-task Experiments were conducted on fine-tuned FLAN-MoE and dense models, and the results are shown in Figure 6 below:
Interested readers can read the original text of the paper to learn more More research content.
The above is the detailed content of GPT-4 uses hybrid large models? Research proves that MoE+ instruction tuning indeed makes large models perform better. For more information, please follow other related articles on the PHP Chinese website!

Various ways to efficiently save ChatGPT dialogue records Have you ever thought about saving a ChatGPT-generated conversation record? This article will introduce a variety of saving methods in detail, including official functions, Chrome extensions and screenshots, etc., to help you make full use of ChatGPT conversation records. Understand the characteristics and steps of various methods and choose the one that suits you best. [Introduction to the latest AI proxy "OpenAI Operator" released by OpenAI] (The link to OpenAI Operator should be inserted here) Table of contents Save conversation records using ChatGPT Export Steps to use the official export function Save ChatGPT logs using Chrome extension ChatGP

Modern society has a compact pace and efficient schedule management is crucial. Work, life, study and other tasks are intertwined, and prioritization and schedules are often a headache. Therefore, intelligent schedule management methods using AI technology have attracted much attention. In particular, ChatGPT's powerful natural language processing capabilities can automate tedious schedules and task management, significantly improving productivity. This article will explain in-depth how to use ChatGPT for schedule management. We will combine specific cases and steps to demonstrate how AI can improve daily life and work efficiency. In addition, we will discuss things to note when using ChatGPT to ensure safe and effective use of this technology. Experience ChatGPT now and get your schedule

We will explain how to link Google Sheets and ChatGPT to improve business efficiency. In this article, we will explain in detail how to use the add-on "GPT for Sheets and Docs" that is easy for beginners to use. No programming knowledge is required. Increased business efficiency through ChatGPT and spreadsheet integration This article will focus on how to connect ChatGPT with spreadsheets using add-ons. Add-ons allow you to easily integrate ChatGPT features into your spreadsheets. GPT for Shee

There are overarching trends and patterns that experts are highlighting as they forecast the next few years of the AI revolution. For instance, there's a significant demand for data, which we will discuss later. Additionally, the need for energy is d

ChatGPT is not just a text generation tool, it is a true partner that dramatically increases writers' creativity. By using ChatGPT for the entire writing process, such as initial manuscript creation, ideation ideas, and stylistic changes, you can simultaneously save time and improve quality. This article will explain in detail the specific ways to use ChatGPT at each stage, as well as tips for maximizing productivity and creativity. Additionally, we will examine the synergy that combines ChatGPT with grammar checking tools and SEO optimization tools. Through collaboration with AI, writers can create originality with free ideas

Data visualization using ChatGPT: From graph creation to data analysis Data visualization, which conveys complex information in an easy-to-understand manner, is essential in modern society. In recent years, due to the advancement of AI technology, graph creation using ChatGPT has attracted attention. In this article, we will explain how to create graphs using ChatGPT in an easy-to-understand manner even for beginners. We will introduce the differences between the free version and the paid version (ChatGPT Plus), specific creation steps, and how to display Japanese labels, along with practical examples. Creating graphs using ChatGPT: From basics to advanced use ChatG

In general, we know that AI is big, and getting bigger. It’s fast, and getting faster. Specifically, though, not everyone’s familiar with some of the newest hardware and software approaches in the industry, and how they promote better results. Peopl

ChatGPT Dialogue Record Management Guide: Efficiently organize and make full use of your treasure house of knowledge! ChatGPT dialogue records are a source of creativity and knowledge, but how can growing records be effectively managed? Is it time-consuming to find important information? don’t worry! This article will explain in detail how to effectively "archive" (save and manage) your ChatGPT conversation records. We will cover official archive functions, data export, shared links, and data utilization and considerations. Table of contents Detailed explanation of ChatGPT's "archive" function How to use ChatGPT archive function Save location and viewing method of ChatGPT archive records Cancel and delete methods for ChatGPT archive records Cancel archive Delete the archive Summarize Ch


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Zend Studio 13.0.1
Powerful PHP integrated development environment

WebStorm Mac version
Useful JavaScript development tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

SublimeText3 Chinese version
Chinese version, very easy to use

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool
