search
HomeTechnology peripheralsAIA single 4090 inferable, 200 billion sparse large model 'Tiangong MoE' is open source

In the wave of large models, training and deploying state-of-the-art dense set LLMs poses huge challenges in terms of computational requirements and associated costs, especially at scales of tens or hundreds of billions of parameters. To address these challenges, sparse models, such as Mixture of Experts (MoE) models, have become increasingly important. These models offer an economically viable alternative by distributing computation to various specialized sub-models, or "experts," with the potential to match or even exceed the performance of dense set models with very low resource requirements.

On June 3, important news came from the field of open source large models: Kunlun Wanwei announced the open source of the 200 billion sparse large model Skywork-MoE. While maintaining strong performance, it has greatly improved Reduces reasoning costs.

Based on the previous open source Skywork-13B model intermediate checkpoint extension of Kunlun Wanwei. It is the first open source 100 billion MoE large model that fully applies and implements MoE Upcycling technology. It is also the first to support the use of a single 4090 An open source 100 billion MoE large model for server inference.

What attracts more attention to the large model community is that Skywork-MoE’s model weights and technical reports are completely open source and free for commercial use without application.

  • Model weight download address:

○ https://huggingface.co/Skywork/Skywork-MoE-base

○ https://huggingface.co/Skywork/Skywork-MoE-Base-FP8

  • Model open source warehouse: https://github.com/SkyworkAI/Skywork-MoE

  • Model technical report: https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf

  • Model inference code: (Supports 8-bit quantitative loading inference on 8x4090 server) https://github.com/SkyworkAI/vllm

Skywork-MoE is currently available on 8x4090 server The largest open source MoE model for inference. The 8x4090 server has a total of 192GB of GPU memory. Under FP8 quantization (weight occupies 146GB), using the non-uniform Tensor Parallel parallel reasoning method pioneered by the Kunlun Wanwei team, Skywork-MoE can reach 2200 tokens/s within a suitable batch size. Hesitation.

For the complete related inference framework code and installation environment, please see: https://github.com/SkyworkAI/Skywork-MoE

Skywork-MoE Introduction

This open source Skywork-MoE model belongs to the R&D model series of Tiangong 3.0, and is the mid-range model (Skywork-MoE-Medium). The total parameter amount of the model is 146B, and the amount of activated parameters is 146B. 22B, there are 16 Experts in total, each Expert is 13B in size, and 2 Experts are activated each time.

It is understood that Tiangong 3.0 has also trained two MoE models, 75B (Skywork-MoE-Small) and 400B (Skywork-MoE-Large), which are not included in this open source.

Kunlun Wanwei evaluated Skywork-MoE based on the current major mainstream model evaluation lists. Under the same activation parameter amount of 20B (inference calculation amount), Skywork-MoE's capabilities are at the forefront of the industry, close to 70B Dense Model. This reduces the model’s inference cost by nearly 3 times.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

It is worth noting that the total parameter size of Skywork-MoE is 1/3 smaller than the total parameter size of DeepSeekV2, achieving similar capabilities with a smaller parameter size. .

Technical Innovation

In order to solve the problems of difficult MoE model training and poor generalization performance, Skywork-MoE designed two training optimization algorithms:

Gating Logits Normalization operation

Skywork-MoE adds a new normalization operation in the token distribution logic of the Gating Layer, making the parameter learning of the Gating Layer more inclined to the selected top -2 experts, increasing the confidence of the MoE model for top-2:

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open sourceAdaptive Aux Loss

is different from the traditional fixed coefficient ( (Fixed hyperparameters) aux loss, Skywork-MoE allows the model to adaptively select appropriate aux loss hyperparameter coefficients at different stages of MoE training, thereby keeping the Drop Token Rate within an appropriate range, and achieving expert distribution Balance can also allow expert learning to be differentiated, thereby improving the overall performance and generalization level of the model. In the early stage of MoE training, due to insufficient parameter learning, the Drop Token Rate was too high (the token distribution was too different). At this time, a larger aux loss was needed to help token load balance; in the later stage of MoE training, the Skywork-MoE team hopes A certain degree of differentiation is still ensured between Experts to avoid Gating's tendency to randomly distribute Tokens, so a lower aux loss is required to reduce correction.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

Training Infra

How to efficiently conduct large-scale distributed training of MoE models is a difficult challenge. Skywork-MoE proposes two important parallel optimization designs to achieve 38% training throughput of MFU on a kilocalorie cluster, where MFU calculates the theoretical computational load with an activation parameter of 22B.

Expert Data Parallel

Different from the existing EP (Expert Parallel) and ETP (Expert Tensor Parallel) designs in the Megatron-LM community, the Skywork-MoE team proposed a parallel design solution called Expert Data Parallel. This parallel solution can When the number of Experts is small, the model can still be segmented efficiently, and the all2all communication introduced by Experts can also be optimized and masked to the greatest extent. Compared with EP's limitation on the number of GPUs and ETP's inefficiency on kilo-card clusters, EDP can better solve the parallel pain points of large-scale distributed training MoE. At the same time, EDP's design is simple, robust, easy to expand, and can be compared Quick implementation and verification.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

This is the simplest EDP example. In the case of two cards, TP = 2, EP = 2, where the attention part uses Tensor Parallel, Expert part Using Expert Parallel

Non-uniform split pipeline parallel

Due to the Embedding calculation of the first stage and the Loss calculation of the last stage, as well as the Pipeline Buffer There is an obvious imbalance in the computing load and video memory load of each stage when the Layer is evenly divided under pipeline parallelism. The Skywork-MoE team proposed a non-uniform pipeline parallel segmentation and recalculation layer allocation method to make the overall computing/graphics memory load more balanced and improve the end-to-end training throughput by about 10%.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

Compare the parallel bubbles under uniform and non-uniform cutting: For a 24-layer LLM, (a) is uniform cutting Divided into 4 stages, the number of layers in each stage is: [6, 6, 6, 6]. (b) is the optimized non-uniform segmentation method, divided into 5 stages, the number of layers in each stage is :[5, 5, 5, 5, 4], in the stage when the middle flow is full, the non-uniformly divided bubbles are lower.

In addition, Skywork-MoE also used a series of experiments based on Scaling Law to explore which constraints affect the performance of Upcycling and From Scratch training MoE models.

A single 4090 inferable, 200 billion sparse large model Tiangong MoE is open source

A rule of thumb that can be followed is: if the FLOPs of training the MoE model are more than 2 times that of training the Dense model, then it will be better to choose from Scratch to train MoE, otherwise , choosing Upcycling to train MoE can significantly reduce training costs. ###

The above is the detailed content of A single 4090 inferable, 200 billion sparse large model 'Tiangong MoE' is open source. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Reading The AI Index 2025: Is AI Your Friend, Foe, Or Co-Pilot?Reading The AI Index 2025: Is AI Your Friend, Foe, Or Co-Pilot?Apr 11, 2025 pm 12:13 PM

The 2025 Artificial Intelligence Index Report released by the Stanford University Institute for Human-Oriented Artificial Intelligence provides a good overview of the ongoing artificial intelligence revolution. Let’s interpret it in four simple concepts: cognition (understand what is happening), appreciation (seeing benefits), acceptance (face challenges), and responsibility (find our responsibilities). Cognition: Artificial intelligence is everywhere and is developing rapidly We need to be keenly aware of how quickly artificial intelligence is developing and spreading. Artificial intelligence systems are constantly improving, achieving excellent results in math and complex thinking tests, and just a year ago they failed miserably in these tests. Imagine AI solving complex coding problems or graduate-level scientific problems – since 2023

Getting Started With Meta Llama 3.2 - Analytics VidhyaGetting Started With Meta Llama 3.2 - Analytics VidhyaApr 11, 2025 pm 12:04 PM

Meta's Llama 3.2: A Leap Forward in Multimodal and Mobile AI Meta recently unveiled Llama 3.2, a significant advancement in AI featuring powerful vision capabilities and lightweight text models optimized for mobile devices. Building on the success o

AV Bytes: Meta's Llama 3.2, Google's Gemini 1.5, and MoreAV Bytes: Meta's Llama 3.2, Google's Gemini 1.5, and MoreApr 11, 2025 pm 12:01 PM

This week's AI landscape: A whirlwind of advancements, ethical considerations, and regulatory debates. Major players like OpenAI, Google, Meta, and Microsoft have unleashed a torrent of updates, from groundbreaking new models to crucial shifts in le

The Human Cost Of Talking To Machines: Can A Chatbot Really Care?The Human Cost Of Talking To Machines: Can A Chatbot Really Care?Apr 11, 2025 pm 12:00 PM

The comforting illusion of connection: Are we truly flourishing in our relationships with AI? This question challenged the optimistic tone of MIT Media Lab's "Advancing Humans with AI (AHA)" symposium. While the event showcased cutting-edg

Understanding SciPy Library in PythonUnderstanding SciPy Library in PythonApr 11, 2025 am 11:57 AM

Introduction Imagine you're a scientist or engineer tackling complex problems – differential equations, optimization challenges, or Fourier analysis. Python's ease of use and graphics capabilities are appealing, but these tasks demand powerful tools

3 Methods to Run Llama 3.2 - Analytics Vidhya3 Methods to Run Llama 3.2 - Analytics VidhyaApr 11, 2025 am 11:56 AM

Meta's Llama 3.2: A Multimodal AI Powerhouse Meta's latest multimodal model, Llama 3.2, represents a significant advancement in AI, boasting enhanced language comprehension, improved accuracy, and superior text generation capabilities. Its ability t

Automating Data Quality Checks with DagsterAutomating Data Quality Checks with DagsterApr 11, 2025 am 11:44 AM

Data Quality Assurance: Automating Checks with Dagster and Great Expectations Maintaining high data quality is critical for data-driven businesses. As data volumes and sources increase, manual quality control becomes inefficient and prone to errors.

Do Mainframes Have A Role In The AI Era?Do Mainframes Have A Role In The AI Era?Apr 11, 2025 am 11:42 AM

Mainframes: The Unsung Heroes of the AI Revolution While servers excel at general-purpose applications and handling multiple clients, mainframes are built for high-volume, mission-critical tasks. These powerful systems are frequently found in heavil

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.