Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training-AI-php.cn

Home

Technology peripherals

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 08, 2024 pm 07:51 PM

projectByteDance

As the iteration speed of large models becomes faster and faster, the scale of training clusters becomes larger and larger, and high-frequency software and hardware failures have become pain points that hinder the further improvement of training efficiency. The checkpoint system is responsible for the status during the training process. Storage and recovery have become the key to overcoming training failures, ensuring training progress and improving training efficiency.

Recently, the ByteDance Beanbao model team and the University of Hong Kong jointly proposed ByteCheckpoint. This is a large model checkpointing system native to PyTorch, compatible with multiple training frameworks, and supports efficient reading and writing of checkpoints and automatic re-segmentation. Compared with existing methods, it has significant performance improvements and ease-of-use advantages. This article introduces the challenges faced by Checkpoint in improving large model training efficiency, summarizes ByteCheckpoint’s solution ideas, system design, I/O performance optimization technology, and experimental results in storage performance and read performance testing.

Meta officials recently disclosed the failure rate of Llama3 405B training on 16384 H100 80GB training clusters - in just 54 days, 419 interruptions occurred, with an average crash every three hours, attracting the attention of many practitioners. .

As a common saying in the industry says, the only certainty for large-scale training systems is software and hardware failure. As the training scale and model size increase, overcoming software and hardware failures and improving training efficiency have become important influencing factors for large model iterations.

Checkpoint has become the key to improving training efficiency. In the Llama training report, the technical team mentioned that in order to combat the high failure rate, frequent checkpoints need to be performed during the training process to save the status of the model, optimizer, and data reader during training to reduce the loss of training progress.

The ByteDance Beanbao large model team and the University of Hong Kong recently released the results - ByteCheckpoint, a PyTorch native, compatible with multiple training frameworks, and a large model Checkpointing system that supports efficient reading and writing of Checkpoint and automatic re-segmentation.

Compared with the baseline method, ByteCheckpoint improves performance by up to 529.22 times on checkpoint saving and up to 3.51 times on loading. The minimalist user interface and Checkpoint automatic re-segmentation function significantly reduce user acquisition and usage costs and improve the ease of use of the system.

The results of the paper have now been made public.

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

ByteCheckpoint: A Unified Checkpointing System for LLM Development
Paper link: https://team.doubao.com/zh/publication/bytecheckpoint-a-unified-checkpointing-system-for-llm-development?view_from =research

Technical challenges of Checkpoint technology in large model training

Current Checkpoint related technologies face a total of four challenges in supporting large model training efficiency:

The existing system design has flaws, which significantly increases the additional I/O overhead of training

In the process of training industrial-level large language models (LLM), the training status needs to pass checkpoint technology (Checkpointing) for saving and persistence. Typically, a Checkpoint consists of 5 parts (model, optimizer, data reader, random number and user-defined configuration). This process often brings minutes-level blockage to training, seriously affecting training efficiency.

In large-scale training scenarios using remote persistent storage systems, the existing Checkpointing system does not fully utilize the GPU to CPU memory copy (D2H copy), serialization, local save, and upload to storage during the Checkpoint save process. Execution independence of each stage of the system.

In addition, the parallel processing potential of different training processes sharing Checkpoint access tasks has not been fully explored. These system design deficiencies increase the additional I/O overhead caused by Checkpoint training.

Checkpoint is difficult to re-segment, and the development and maintenance overhead of manual segmentation script is too high

In different training stages of LLM (pre-training to SFT or RLHF) and different tasks (from When migrating Checkpoints between training tasks (pulling Checkpoints from different stages for automatic evaluation), it is usually necessary to re-segment the Checkpoints stored in the persistent storage system (Checkpoint Resharding) to adapt to the new parallelism configuration of the downstream tasks. and quotas for available GPU resources.

Existing checkpointing systems [1, 2, 3, 4] all assume that the parallelism configuration and GPU resources remain unchanged during storage and loading, and cannot handle the need for checkpoint re-segmentation. A common solution currently in the industry is to customize Checkpoint merging or re-splitting scripts for different models. This method brings a lot of development and maintenance overhead, and has poor scalability.

The Checkpoint modules of different training frameworks are fragmented, which brings challenges to the unified management and performance optimization of Checkpoint

On the training platform in the industry, engineers and scientists often work together based on task characteristics , select the appropriate framework (Megatron-LM [5], FSDP [6], DeepSpeed [7], veScale [8, 9]) for training, and save the Checkpoint to the storage system. However, these different training frameworks have their own independent checkpoint formats and reading and writing modules. The checkpoint module designs of different training frameworks are different, which brings challenges to the unified checkpoint management and performance optimization of the underlying system.

Users of distributed training systems face multiple problems

From the perspective of users of training systems (AI research scientists or engineers), when users use distributed training systems, The checkpoint direction is often troubled by three problems:

1) How to store checkpoints efficiently and save checkpoints without affecting training efficiency.

2) How to re-segment the Checkpoint and read it correctly according to the new parallelism for the Checkpoint stored under one degree of parallelism.

3) How to upload the trained products to a cloud storage system (HDFS, S3, etc.) and manually manage multiple storage systems, which is costly for users to learn and use.

In response to the above problems, the ByteDance Beanbao model team and the laboratory of Professor Wu Chuan of the University of Hong Kong jointly launched ByteCheckpoint.

ByteCheckpoint is a high-performance distributed checkpointing system that is unified with multiple training frameworks, supports multiple storage backends, and has the ability to automatically re-segment checkpoints. ByteCheckpoint provides a simple and easy-to-use user interface, implements a large number of I/O performance optimization technologies to improve storage and reading checkpoint performance, and supports flexible migration of checkpoints in tasks with different parallelism configurations.

System design

Storage architecture

ByteCheckpoint adopts a metadata/tensor data separated storage architecture to realize the decoupling of Checkpoint management and training framework and parallelism .

Tensor slices (Tensor Shard) of models in different training frameworks and optimizers are stored in storage files, and meta information (TensorMeta, ShardMeta, ByteMeta) is stored in a globally unique metadata file.

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

When using different parallelism configurations to read Checkpoint, as shown in the figure below, each training process only needs to set the query meta-information according to the current parallelism to obtain the storage location of the tensor required by the process. Then read directly according to the position to realize automatic checkpoint re-segmentation.

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

Clever solution to irregular tensor segmentation

When different training frameworks are running, they often flatten the shape of the tensor in the model or optimizer into one dimension, thereby improving the set Communication performance. This flattening operation brings the challenge of irregular tensor sharding (Irregular Tensor Sharding) to Checkpoint storage.

As shown in the figure below, in Megatron-LM (distributed large model training framework developed by NVIDIA) and veScale (PyTorch native distributed large model training framework developed by ByteDance), the model parameters correspond to The optimizer state will be flattened into one dimension, merged, and then split according to data parallelism. This results in tensors being irregularly split into different processes, and the meta-information of tensor slices cannot be represented using offset and length tuples, making storage and reading difficult.

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

The problem of irregular tensor segmentation also exists in the FSDP framework.

To eliminate irregularly cut tensor slices, the FSDP framework will perform all-gather set communication and D2H copy operations on one-dimensional tensor slices on all processes before storing Checkpoints to obtain complete irregular cuts. divided tensor. This solution brings huge communication and frequent GPU-CPU synchronization overhead, which seriously affects the performance of Checkpoint storage.

To address this problem, ByteCheckpoint proposed the Asynchronous Tensor Merging technology.

ByteCheckpoint first finds out the irregularly divided tensors in different processes, and then uses asynchronous P2P communication to distribute these irregular tensors to different processes for merging. All P2P communication waits (Wait) and tensor D2H copy operations for these irregular tensors are postponed until they are about to enter the serialization phase, thereby eliminating frequent synchronization overhead and increasing communication with other Checkpoint storage processes. Perform overlap.

System Architecture

The following figure shows the system architecture of ByteCheckpoint:

The API layer provides simple, easy-to-use and unified reading and writing for different training frameworks ( Save ) and read (Load) interface.

The Planner layer will generate access plans for different training processes based on the access objects, and hand them over to the Execution layer to perform the actual I/O tasks.

The Execution layer performs I/O tasks and interacts with the Storage layer, using various I/O optimization technologies for high-performance checkpoint access.

The Storage layer manages different storage backends and performs corresponding optimizations according to different storage backends during I/O tasks.

The layered design enhances the scalability of the system to support more training frameworks and storage backends in the future.

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

API use cases

ByteCheckpoint’s API use cases are as follows:

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

ByteCheckpoint は最小限の API を提供し、ユーザーの開始コストを削減します。チェックポイントを保存および読み取りする場合、ユーザーはストレージ関数とロード関数を呼び出して、保存および読み取りの対象となるコンテンツ、ファイルシステムパス、およびさまざまなパフォーマンス最適化オプションを渡すだけで済みます。

I/Oパフォーマンス最適化技術

チェックポイントストレージ最適化

パイプライン実行

teCheckpoint 完全に非同期のストレージパイプラインを設計しました ( Save Pipeline) は、チェックポイントストレージのさまざまなステージ (P2P テンソル転送、D2H レプリケーション、シリアル化、ローカルファイルシステムの保存およびファイルシステムのアップロード) を分割して、効率的なパイプラインの実行を実現します。

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

繰り返しのメモリ割り当てを避ける

D2H コピープロセスでは、ByteCheckpoint は固定メモリプール (Pinned Memory Pool) を使用し、繰り返しメモリ割り当ての時間オーバーヘッドを削減します。

さらに、高頻度のストレージシナリオで固定メモリプールのリサイクルを同期的に待機することによって生じる追加の時間オーバーヘッドを削減するために、ByteCheckpoint は固定メモリプールに基づくピンポンバッファリングメカニズムを追加します。 2 つの独立したメモリプールが交互に読み取りバッファと書き込みバッファの役割を果たし、後続の I/O 操作を実行する GPU および I/O ワーカーと対話して、ストレージ効率をさらに向上させます。

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

負荷分散

データ並列 (Data-Parallel または DP) トレーニングでは、モデルは異なるデータ並列プロセスグループ (DP グループ) 間で冗長化され、負荷分散アルゴリズムが採用され、冗長性が均等に分散されます。ストレージ用にテンソルをさまざまなプロセスグループにモデル化し、チェックポイントのストレージ効率を効果的に向上させます。図に示すように、チェックポイントを読み取るために並列処理を変更する場合、新しいトレーニングプロセスは、オリジナルテンソルスライスからその一部を読み取ります。

ByteCheckpoint は、オンデマンドの部分ファイル読み取り (部分ファイル読み取り) テクノロジーを使用して、リモートストレージから必要なファイルの断片を直接読み取り、不要なデータのダウンロードや読み取りを回避します。

データ並列 (Data-Parallel または DP) トレーニングでは、モデルは異なるデータ並列プロセスグループ (DP グループ) 間で冗長であり、異なるプロセスグループは同じテンソルスライスを繰り返し読み取ります。大規模なトレーニングシナリオでは、さまざまなプロセスグループがリモート永続ストレージシステム (HDFS など) に同時に大量のリクエストを送信するため、ストレージシステムに大きな負荷がかかります。

繰り返しのデータ読み取りを排除し、トレーニングプロセスによって HDFS に送信されるリクエストを削減し、読み込みパフォーマンスを最適化するために、ByteCheckpoint は同じテンソルスライス読み取りタスクを異なるプロセスに均等に分散し、フェッチ中にリモートファイルを読み取ります。、GPU 間のアイドル帯域幅はテンソルスライスの送信に使用されます。 experimperimperimperimperimperimperimpalimpalimpalimpalimpermental configuration densegptモデルとsparsegptモデル（GPT-3 [10]構造に基づいて実装）を使用し、異なるモデルパラメーター量と異なるものを使用します。トレーニングフレームワーク ByteCheckpoint のチェックポイントアクセスの正確性、ストレージパフォーマンス、および読み取りパフォーマンスが、さまざまなサイズのトレーニングタスクで評価されました。実験構成と正確性テストの詳細については、完全な論文を参照してください。

ストレージパフォーマンステスト

ストレージパフォーマンステストでは、チームはトレーニングプロセス中にさまざまなモデルサイズとトレーニングフレームワークを比較し、合計 50 ステップまたは 100 ステップごとに Checkpoint、Bytecheckpoint、および Baseline メソッドを保存しました。トレーニングによって発生するブロック時間 (チェックポイントの停止)。

書き込みパフォーマンスの徹底的な最適化のおかげで、ByteCheckpoint は、576 カードの SparseGPT 110B - Megatron-LM トレーニングタスクで、ベースラインより 10% 高いパフォーマンスを達成しました。パフォーマンスの向上は 66.65 ～ 74.55 倍であり、256 カードの DenseGPT 10B-FSDP トレーニングタスクでは 529.22 倍のパフォーマンス向上に達することもあります。

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

読み取りパフォーマンステスト

読み取りパフォーマンステストでは、チームはダウンストリームタスクの並列性に基づいて、チェックポイントを読み取るさまざまなメソッドの読み込み時間を比較しました。 ByteCheckpoint は、ベースライン手法と比較して 1.55 ～ 3.37 倍のパフォーマンス向上を達成します。

チームは、ByteCheckpoint のパフォーマンス向上が Megatron-LM ベースライン手法と比較してより顕著であることを観察しました。これは、Megatron-LM が、チェックポイントを新しい並列処理構成に読み取る前に、オフラインスクリプトを実行して分散チェックポイントを再シャードする必要があるためです。対照的に、ByteCheckpoint は、オフラインスクリプトを実行せずに自動チェックポイント再セグメント化を直接実行し、読み取りを効率的に完了できます。

Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training

最後に、ByteCheckpoint の将来の計画に関して、チームは 2 つの側面から開始したいと考えています:

まず、超大規模 GPU クラスタートレーニングタスクの効率的なチェックポイント設定をサポートするという長期目標を達成します。

2 番目に、大規模モデルのトレーニングのライフサイクル全体のチェックポイント管理を実現し、事前トレーニング (Pre-Training) から教師あり微調整 (SFT)、強化学習 (RLHF) までのすべてのシナリオでチェックポイントをサポートします。）と評価（Evaluation）およびその他のシナリオ。

チーム紹介

ByteDance Beanbao ビッグモデルチームは 2023 年に設立されました。業界で最先端の AI ビッグモデルテクノロジーを開発し、世界クラスの研究チームになることに尽力しています。技術と社会の発展に貢献します。

現在、チームには引き続き優秀な人材が入社しており、そのキーワードはハードコア、オープン、そして革新的な精神に満ちており、チームメンバーを奨励し、前向きな職場環境を作り出すことに尽力しています。学び、成長し続けること、そして挑戦を恐れず卓越性を追求すること。

革新的な精神と責任感を持った技術人材と協力して、大規模モデルトレーニングの効率向上を促進し、より多くの進歩と結果を達成したいと考えています。

参考資料

^{[1] Mohan、Jayashree、Amar Phanishaye、Vijay Chidambaram: 頻繁な{きめ細かな{DNN} チェックポイント。」ファイルおよびストレージテクノロジー (FAST 21)。}

^{[2] アイゼンマン、アサフ他「{Check-N-Run}: 第 19 回 USENIX シンポジウム」ネットワーク化されたシステムの設計と実装 (NSDI 22) について。}

[3] Wang、Zhuang 他「Gemini: 第 29 回シンポジウムの議事録」オペレーティングシステムの原則について。2023 年。

[4] Gupta、Tanmaey、他「ジャストインタイムチェックポイント: ディープラーニングトレーニングの失敗からの低コストのエラー回復」システム。2024.

[5] Shoeybi、Mohammad、他「Megatron-lm: モデル並列処理を使用した数十億パラメータ言語モデルのトレーニング。arXiv プレプリント arXiv:1909.08053 (2019)」

^{[6] Zhao、Yanli、他「Pytorch fsdp: 完全にシャード化されたデータの並列スケーリングに関する経験。」arXiv プレプリント arXiv:2304.11277 (2023)。}

^{[7] Rasley、Jeff、他。「Deepspeed: システムの最適化により、1,000 億を超えるパラメーターを使用した深層学習モデルのトレーニングが可能になります。2020 年の第 26 回 ACM SIGKDD 国際会議の議事録。}

^{」[8] Jiang、Ziheng、他。「{MegaScale}: 大規模な言語モデルのトレーニングを 10,000 個を超える {GPU} に拡張する。」第 21 回 USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)。}

^{[9] veScale: A PyTorch Native LLM。トレーニングフレームワーク https://github.com/volcengine/veScale}

^{[10] Brown、Tom、他「言語モデルはニューラル情報処理システムの進歩 33 (2020)」 : 1877-1901.}

The above is the detailed content of Llama3 training crashes every 3 hours? Big bean bag model and HKU team improve crispy Wanka training. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Why Companies Are Losing AI Talent — And How Leaders Can Stop ItMay 14, 2025 am 04:43 AM

But what if this isn’t a hiring crisis at all? What if it’s a leadership one? While the spotlight has been on salaries and skills shortages, some experts argue that it isn’t just that AI professionals are hard to hire, but also that they’re easy to

Creating minutes with ChatGPT! Examples of transcription and prompts [Free materials provided]May 14, 2025 am 04:41 AM

Creating meeting minutes is an essential task for many businesses, but manual labor is a huge amount of time and effort required. What's attracting attention is the automation of minutes using OpenAI's natural language processing tool ChatGPT. This article will explain in detail how to efficiently create minutes by combining transcription tools such as ChatGPT and Whisper. Automating minutes preparation offers a variety of benefits, such as significantly reducing work time and reducing internal burdens. On the other hand, appropriate operation is required, including security measures and the need for human checks.

Explaining how to create PowerPoint slides in ChatGPT! [With prompt]May 14, 2025 am 04:39 AM

Creating presentation materials requires a lot of time and effort, but using ChatGPT can improve efficiency. This article explains how to quickly and effectively create PowerPoint materials using ChatGPT. We will introduce various approaches, from automatic generation of VBA macros to using ChatGPT add-in to linking with plugins. The specific steps are explained in an easy-to-understand manner, so even beginners can easily try it out. Please use this article as a reference to experience creating next-generation presentations along with ChatGPT.

What is the task function in ChatGPT? A thorough explanation of efficient task management methods, examples of use, and promptsMay 14, 2025 am 04:35 AM

While improving daily work efficiency is a goal everyone needs, it is often faced with the difficulty of task management. This is where the focus is on ChatGPT, a new solution that utilizes AI. This article provides detailed information on the benefits of task management using ChatGPT, specific ways to use it, and how to improve efficiency by checking progress and self-reflection. It ranges from creating lists, categorization, creating schedules, and setting project milestones. Let's aim to further improve productivity with this new approach. Op

RSAC 2025: Agentic AI, Identity And The New Rules Of Cyber DefenseMay 14, 2025 am 04:34 AM

This year’s conference ended with a mix of urgency and cautious optimism. The theme, “Many Voices. One Community,” reflected a core idea: cybersecurity is moving too fast for any one group to manage alone. The field now faces bigger attack surfaces,

How do I search and connect to ChatGPT online? We also introduce ways that don't require plugins!May 14, 2025 am 04:32 AM

Some people may want to deepen their use of ChatGPT, but are troubled by outdated data. In fact, ChatGPT can utilize real-time Internet information. In this article, we will explain four main ways to connect ChatGPT to the Internet: plugins, web browsing features, Bing Copilot, and Google Gemini. By using these, you can maximize the power of ChatGPT in business and learning. OpenAI's latest AI agent, OpenAI

What is OpenAI Operator? A thorough explanation of the main functions, usage, and safety measuresMay 14, 2025 am 04:30 AM

In recent years, the evolution of AI technology has been remarkable, with major advances being made, especially in the field of AI agents. Among these, OpenAI's "Operator" has attracted a lot of attention, with innovative features that set it apart from previous agents. In this article, we will provide a detailed explanation of OpenAI Operator, from its mechanism, as well as its wide range of safety initiatives and future prospects. Click here for more information about OpenAI's latest AI agent, OpenAI Deep Research ⬇️ [ChatG

A thorough explanation of how to use ChatGPT for text miningMay 14, 2025 am 04:29 AM

Text mining using ChatGPT: Efficient data analysis Text mining, which extracts useful information from a vast amount of unstructured data, is dramatically streamlined using AI technologies such as ChatGPT. In this article, we will explain the text mining method using ChatGPT, with specific examples. Master a variety of approaches, including document summaries, keyword occurrence rate analysis, and user review classification, and aim to improve the efficiency of data analysis. table of contents What is text mining? Text mining with ChatGPT: Practice

See all articles