Home  >  Article  >  Technology peripherals  >  MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

WBOY
WBOYforward
2024-01-30 14:42:151320browse

The legendary "magic weapon" of GPT-4 - the MoE (Mixed Expert) architecture, can be used by yourself!

There is a machine learning guru on Hugging Face who shared how to build a complete MoE system from scratch.

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

This project is called MakeMoE by the author, and details the process from attention construction to the formation of a complete MoE model.

According to the author, MakeMoE was inspired by and based on the makemore of OpenAI founding member Andrej Karpathy.

makemore is a teaching project for natural language processing and machine learning, intended to help learners understand and implement some basic models.

Similarly, MakeMoE also helps learners gain a deeper understanding of the hybrid expert model in the step-by-step building process.

So, what exactly does this "Hand Rubbing Guide" talk about?

Build MoE model from scratch

Compared with Karpathy's makemore, MakeMoE replaces the isolated feedforward neural network with a sparse mixture of experts, while adding the necessary gating logic.

At the same time, because the ReLU activation function needs to be used in the process, the default initialization method in makemore is replaced by the Kaiming He method.

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

If you want to create a MoE model, you must first understand the self-attention mechanism.

The model first transforms the input sequence into parameters represented by queries (Q), keys (K) and values ​​(V) through linear transformation.

These parameters are then used to calculate attention scores, which determine how much attention the model pays to each position in the sequence when generating each token.

In order to ensure the autoregressive characteristics of the model when generating text, that is, it can only predict the next token based on the already generated token, the author uses a multi-head causal self-attention machine mechanism.

This mechanism uses a mask to set the attention scores of unprocessed positions to negative infinity, so that the weights of these positions will become zero.

Multi-head causality allows the model to perform multiple such attention calculations in parallel, with each head focusing on different parts of the sequence.

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

After completing the configuration of the self-attention mechanism, you can create the expert module. The "expert module" here is a multi-layer perceptron.

Each expert module contains a linear layer that maps the embedding vector to a larger dimension, and then through a nonlinear activation function (such as ReLU), and another linear layer to map the vector back to the original Embed dimensions.

This design enables each expert to focus on processing different parts of the input sequence, and uses the gating network to decide which experts should be activated when generating each token.

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

#So, the next step is to start building the component for allocating and managing experts - the gate control network.

The gated network here is also implemented through a linear layer, which maps the output of the self-attention layer to the number of expert modules.

The output of this linear layer is a score vector, each score represents the importance of the corresponding expert module to the currently processed token.

The gated network will calculate the top-k values ​​of this score vector and record its index, and then select the top-k largest scores from them to weight the corresponding expert module output.

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

In order to increase the explorability of the model during the training process, the author also introduced noise to avoid that all tokens tend to be processed by the same experts.

This noise is usually achieved by adding random Gaussian noise to the fractional vector.

MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed

After obtaining the results, the model selectively multiplies the first k values ​​with the outputs of the top k experts of the corresponding token, and then adds them to form a weighted sum to form the model Output.

Finally, put these modules together to get a MoE model.

For the above entire process, the author has provided the corresponding code, you can learn more about it in the original article.

In addition, the author also produced end-to-end Jupyter notes, which can be run directly while learning each module.

If you are interested, learn it quickly!

Original address: https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch
Note version (GitHub): https://github. com/AviSoori1x/makeMoE/tree/main

The above is the detailed content of MoE Large Model Making Guide: Zero-Based Manual Building Methods, Master-Level Tutorials Revealed. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete