The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM-AI-php.cn

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Mar 01, 2024 pm 04:01 PM

aiModel

As the technical analysis of Sora unfolds, the importance of AI infrastructure becomes increasingly prominent.

A new paper from Byte and Peking University attracted attention at this time:

The article disclosed that the Wanka cluster built by Byte can be used in ## Complete the training of GPT-3 scale model (175B) within #1.75 days.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

Specifically, Byte proposed a production system called

MegaScale, which aims to solve the problems faced when training large models on the Wanka cluster. efficiency and stability challenges.

When training a 175 billion parameter large language model on 12288 GPUs, MegaScale achieved a computing power utilization of 55.2%

(MFU) , which is 1.34 times that of NVIDIA Megatron-LM.

The paper also revealed that as of September 2023, Byte has established an Ampere architecture GPU

(A100/A800) cluster with more than 10,000 cards, and is currently building a large-scale Hopper architecture (H100/H800)Cluster.

Production system suitable for Wanka cluster

In the era of large models, the importance of GPU no longer needs elaboration.

But the training of large models cannot be started directly when the number of cards is full - when the scale of the GPU cluster reaches the "10,000" level, how to achieve

efficiency and stability# The training of ## is itself a challenging engineering problem.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM The first challenge: efficiency.

Training a large language model is not a simple parallel task. The model needs to be distributed among multiple GPUs, and these GPUs require frequent communication to jointly advance the training process. In addition to communication, factors such as operator optimization, data preprocessing and GPU memory consumption all have an impact on computing power utilization

(MFU)

, an indicator that measures training efficiency.

MFU is the ratio of actual throughput to theoretical maximum throughput.

The second challenge: stability.

We know that training large language models often takes a very long time, which also means that failures and delays during the training process are not uncommon.

The cost of failure is high, so how to shorten the fault recovery time becomes particularly important.

In order to address these challenges, ByteDance researchers built MegaScale and have deployed it in Byte's data center to support the training of various large models.

MegaScale is improved on the basis of NVIDIA Megatron-LM.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM # Specific improvements include the co-design of algorithms and system components, optimization of communication and computational overlap, operator optimization, data pipeline optimization, and network performance Tuning, etc.:

Algorithm optimization: Researchers introduced parallelized Transformer block, sliding window attention mechanism(SWA) and LAMB in the model architecture Optimizer to improve training efficiency without sacrificing model convergence.
Communication overlap: Based on a detailed analysis of the operations of each computing unit in 3D parallelism(data parallelism, pipeline parallelism, tensor parallelism) , the researchers designed technical strategies to effectively reduce the delay caused by operations on non-critical execution paths and shorten the iteration time of each round in model training.
Efficient operators: The GEMM operator has been optimized, and operations such as LayerNorm and GeLU have been integrated to reduce the overhead of starting multiple cores and optimize memory access patterns.
Data pipeline optimization: Optimize data preprocessing and loading and reduce GPU idle time through asynchronous data preprocessing and elimination of redundant data loaders.
Collective communication group initialization: Optimized the initialization process of NVIDIA multi-card communication framework NCCL in distributed training. Without optimization, the initialization time of a 2048-GPU cluster is 1047 seconds, which can be reduced to less than 5 seconds after optimization; the initialization time of a Wanka GPU cluster can be reduced to less than 30 seconds.
Network Performance Tuning: Analyze inter-machine traffic in 3D parallelism, and design technical solutions to improve network performance, including network topology design, ECMP hash conflict reduction, and congestion control and retransmission timeout settings.
Fault Tolerance: In Wanka cluster, software and hardware failures are unavoidable. The researchers designed a training framework to achieve automatic fault identification and rapid recovery. Specifically, it includes developing diagnostic tools to monitor system components and events, optimizing checkpoint high-frequency saving training processes, etc.

The paper mentioned that MegaScale can automatically detect and repair more than 90% of software and hardware failures.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

Experimental results show that MegaScale achieved 55.2% MFU when training a 175B large language model on 12288 GPUs, which is the Megatrion-LM algorithm 1.34 times the power utilization rate.

The MFU comparison results of training a 530B large language model are as follows:

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

One More Thing

Just after this technical paper triggered a discussion Recently, new news has also come out about the byte-based Sora product:

Jiangying's Sora-like AI video tool has launched an invitation-only beta test.

The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM

#It seems that the foundation has been laid, so are you looking forward to Byte's large model products?

Paper address: https://arxiv.org/abs/2402.15627

The above is the detailed content of The technical details of the Byte Wanka cluster are disclosed: GPT-3 training was completed in 2 days, and the computing power utilization exceeded NVIDIA Megatron-LM. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Most Used 10 Power BI Charts - Analytics VidhyaApr 16, 2025 pm 12:05 PM

Harnessing the Power of Data Visualization with Microsoft Power BI Charts In today's data-driven world, effectively communicating complex information to non-technical audiences is crucial. Data visualization bridges this gap, transforming raw data i

Expert Systems in AIApr 16, 2025 pm 12:00 PM

Expert Systems: A Deep Dive into AI's Decision-Making Power Imagine having access to expert advice on anything, from medical diagnoses to financial planning. That's the power of expert systems in artificial intelligence. These systems mimic the pro

Three Of The Best Vibe Coders Break Down This AI Revolution In CodeApr 16, 2025 am 11:58 AM

First of all, it’s apparent that this is happening quickly. Various companies are talking about the proportions of their code that are currently written by AI, and these are increasing at a rapid clip. There’s a lot of job displacement already around

Runway AI's Gen-4: How Can AI Montage Go Beyond AbsurdityApr 16, 2025 am 11:45 AM

The film industry, alongside all creative sectors, from digital marketing to social media, stands at a technological crossroad. As artificial intelligence begins to reshape every aspect of visual storytelling and change the landscape of entertainment

How to Enroll for 5 Days ISRO AI Free Courses? - Analytics VidhyaApr 16, 2025 am 11:43 AM

ISRO's Free AI/ML Online Course: A Gateway to Geospatial Technology Innovation The Indian Space Research Organisation (ISRO), through its Indian Institute of Remote Sensing (IIRS), is offering a fantastic opportunity for students and professionals to

Local Search Algorithms in AIApr 16, 2025 am 11:40 AM

Local Search Algorithms: A Comprehensive Guide Planning a large-scale event requires efficient workload distribution. When traditional approaches fail, local search algorithms offer a powerful solution. This article explores hill climbing and simul

OpenAI Shifts Focus With GPT-4.1, Prioritizes Coding And Cost EfficiencyApr 16, 2025 am 11:37 AM

The release includes three distinct models, GPT-4.1, GPT-4.1 mini and GPT-4.1 nano, signaling a move toward task-specific optimizations within the large language model landscape. These models are not immediately replacing user-facing interfaces like

The Prompt: ChatGPT Generates Fake PassportsApr 16, 2025 am 11:35 AM

Chip giant Nvidia said on Monday it will start manufacturing AI supercomputers— machines that can process copious amounts of data and run complex algorithms— entirely within the U.S. for the first time. The announcement comes after President Trump si

See all articles