Home  >  Article  >  Technology peripherals  >  A single 4090 inferable, 200 billion sparse large model "Tiangong MoE" is open source

A single 4090 inferable, 200 billion sparse large model "Tiangong MoE" is open source

WBOY
WBOYOriginal
2024-06-05 22:14:46871browse

In the wave of large models, training and deploying state-of-the-art dense set LLMs poses huge challenges in terms of computational requirements and associated costs, especially at scales of tens or hundreds of billions of parameters. To address these challenges, sparse models, such as Mixture of Experts (MoE) models, have become increasingly important. These models offer an economically viable alternative by distributing computation to various specialized sub-models, or "experts," with the potential to match or even exceed the performance of dense set models with very low resource requirements.

On June 3, important news came from the field of open source large models: Kunlun Wanwei announced the open source of the 200 billion sparse large model Skywork-MoE. While maintaining strong performance, it has greatly improved Reduces reasoning costs.

Based on the previous open source Skywork-13B model intermediate checkpoint extension of Kunlun Wanwei. It is the first open source 100 billion MoE large model that fully applies and implements MoE Upcycling technology. It is also the first to support the use of a single 4090 An open source 100 billion MoE large model for server inference.

What attracts more attention to the large model community is that Skywork-MoE’s model weights and technical reports are completely open source and free for commercial use without application.

  • Model weight download address:

○ https://huggingface.co/Skywork/Skywork-MoE-base

○ https://huggingface.co/Skywork/Skywork-MoE-Base-FP8

  • Model open source warehouse: https://github.com/SkyworkAI/Skywork-MoE

  • Model technical report: https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf

  • Model inference code: (Supports 8-bit quantitative loading inference on 8x4090 server) https://github.com/SkyworkAI/vllm

Skywork-MoE is currently available on 8x4090 server The largest open source MoE model for inference. The 8x4090 server has a total of 192GB of GPU memory. Under FP8 quantization (weight occupies 146GB), using the non-uniform Tensor Parallel parallel reasoning method pioneered by the Kunlun Wanwei team, Skywork-MoE can reach 2200 tokens/s within a suitable batch size. Hesitation.

For the complete related inference framework code and installation environment, please see: https://github.com/SkyworkAI/Skywork-MoE

Skywork-MoE Introduction

This open source Skywork-MoE model belongs to the R&D model series of Tiangong 3.0, and is the mid-range model (Skywork-MoE-Medium). The total parameter amount of the model is 146B, and the amount of activated parameters is 146B. 22B, there are 16 Experts in total, each Expert is 13B in size, and 2 Experts are activated each time.

It is understood that Tiangong 3.0 has also trained two MoE models, 75B (Skywork-MoE-Small) and 400B (Skywork-MoE-Large), which are not included in this open source.

Kunlun Wanwei evaluated Skywork-MoE based on the current major mainstream model evaluation lists. Under the same activation parameter amount of 20B (inference calculation amount), Skywork-MoE's capabilities are at the forefront of the industry, close to 70B Dense Model. This reduces the model’s inference cost by nearly 3 times.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

It is worth noting that the total parameter size of Skywork-MoE is 1/3 smaller than the total parameter size of DeepSeekV2, achieving similar capabilities with a smaller parameter size. .

Technical Innovation

In order to solve the problems of difficult MoE model training and poor generalization performance, Skywork-MoE designed two training optimization algorithms:

Gating Logits Normalization operation

Skywork-MoE adds a new normalization operation in the token distribution logic of the Gating Layer, making the parameter learning of the Gating Layer more inclined to the selected top -2 experts, increasing the confidence of the MoE model for top-2:

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open sourceAdaptive Aux Loss

is different from the traditional fixed coefficient ( (Fixed hyperparameters) aux loss, Skywork-MoE allows the model to adaptively select appropriate aux loss hyperparameter coefficients at different stages of MoE training, thereby keeping the Drop Token Rate within an appropriate range, and achieving expert distribution Balance can also allow expert learning to be differentiated, thereby improving the overall performance and generalization level of the model. In the early stage of MoE training, due to insufficient parameter learning, the Drop Token Rate was too high (the token distribution was too different). At this time, a larger aux loss was needed to help token load balance; in the later stage of MoE training, the Skywork-MoE team hopes A certain degree of differentiation is still ensured between Experts to avoid Gating's tendency to randomly distribute Tokens, so a lower aux loss is required to reduce correction.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

Training Infra

How to efficiently conduct large-scale distributed training of MoE models is a difficult challenge. Skywork-MoE proposes two important parallel optimization designs to achieve 38% training throughput of MFU on a kilocalorie cluster, where MFU calculates the theoretical computational load with an activation parameter of 22B.

Expert Data Parallel

Different from the existing EP (Expert Parallel) and ETP (Expert Tensor Parallel) designs in the Megatron-LM community, the Skywork-MoE team proposed a parallel design solution called Expert Data Parallel. This parallel solution can When the number of Experts is small, the model can still be segmented efficiently, and the all2all communication introduced by Experts can also be optimized and masked to the greatest extent. Compared with EP's limitation on the number of GPUs and ETP's inefficiency on kilo-card clusters, EDP can better solve the parallel pain points of large-scale distributed training MoE. At the same time, EDP's design is simple, robust, easy to expand, and can be compared Quick implementation and verification.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

This is the simplest EDP example. In the case of two cards, TP = 2, EP = 2, where the attention part uses Tensor Parallel, Expert part Using Expert Parallel

Non-uniform split pipeline parallel

Due to the Embedding calculation of the first stage and the Loss calculation of the last stage, as well as the Pipeline Buffer There is an obvious imbalance in the computing load and video memory load of each stage when the Layer is evenly divided under pipeline parallelism. The Skywork-MoE team proposed a non-uniform pipeline parallel segmentation and recalculation layer allocation method to make the overall computing/graphics memory load more balanced and improve the end-to-end training throughput by about 10%.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

Compare the parallel bubbles under uniform and non-uniform cutting: For a 24-layer LLM, (a) is uniform cutting Divided into 4 stages, the number of layers in each stage is: [6, 6, 6, 6]. (b) is the optimized non-uniform segmentation method, divided into 5 stages, the number of layers in each stage is :[5, 5, 5, 5, 4], in the stage when the middle flow is full, the non-uniformly divided bubbles are lower.

In addition, Skywork-MoE also used a series of experiments based on Scaling Law to explore which constraints affect the performance of Upcycling and From Scratch training MoE models.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

A rule of thumb that can be followed is: if the FLOPs of training the MoE model are more than 2 times that of training the Dense model, then it will be better to choose from Scratch to train MoE, otherwise , choosing Upcycling to train MoE can significantly reduce training costs. ###

The above is the detailed content of A single 4090 inferable, 200 billion sparse large model "Tiangong MoE" is open source. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn