search
HomeTechnology peripheralsAIDeepSeek #OpenSourceWeek Day 6: Inference System Overview

As we reach Day 6 of #OpenSourceWeek, DeepSeek presented an in-depth overview of the DeepSeek-V3/R1 inference system. This article will dig into the system’s design principles, optimization strategies, and performance statistics, highlighting the significant advancements made in throughput and latency optimization.

Table of contents

  • System Design Principles
  • Prefilling and Decoding Phases
  • Diagram of DeepSeek’s Online Inference System
  • Performance Statistics
  • Conclusion

System Design Principles

The primary objectives of the DeepSeek-V3/ DeepSeek R1 inference system are to achieve higher throughput and lower latency. To meet these goals, they have implemented a sophisticated architecture that leverages cross-node Expert Parallelism (EP). This approach not only enhances the efficiency of GPU matrix computations but also optimizes the overall system performance.

Expert Parallelism (EP)

  • Batch Size Scaling: EP allows for significant scaling of the batch size, which is crucial for maximizing GPU utilization and throughput.
  • Memory Access Reduction: By distributing experts across multiple GPUs, each GPU processes only a small subset of experts, which reduces memory access demands and consequently lowers latency.

However, the implementation of EP introduces complexities, particularly in terms of cross-node communication and the need for effective load balancing across different Data Parallelism (DP) instances.

Addressing Challenges of EP

To tackle these challenges, they focused on three key strategies:

  • Scaling Batch Size: By ensuring a sufficiently large overall batch size, it can maintain high throughput and low latency, even with the model’s inherent sparsity.
  • Hiding Communication Latency: They employ a dual-batch overlap strategy during the prefill and decode phases, allowing them to execute microbatches alternately and hide communication costs behind computation.
  • Load Balancing: They strive to balance computational and communication loads across all GPUs to prevent any single GPU from becoming a bottleneck.

Prefilling and Decoding Phases

The architecture of DeepSeek-V3/R1 employs different degrees of parallelism during the prefill and decode phases:

  • Prefilling Phase: Utilizes Routed Expert EP32 and MLA/Shared Expert DP32, with each deployment unit spanning 4 nodes and 32 redundant routed experts.
  • Decoding Phase: Employs Routed Expert EP144 and MLA/Shared Expert DP144, with each deployment unit spanning 18 nodes.

Communication-Computation Overlapping

To optimize throughput, they have developed a communication-computation overlapping mechanism. During the prefilling phase, it alternates between two microbatches, allowing the communication cost of one microbatch to be hidden behind the computation of the other. In the decoding phase, it subdivides the attention layer into two steps and utilizes a 5-stage pipeline to achieve seamless overlapping.

? Day 6 of #OpenSourceWeek: One More Thing – DeepSeek-V3/R1 Inference System Overview

Optimized throughput and latency via:
? Cross-node EP-powered batch scaling
? Computation-communication overlap
⚖️ Load balancing

Statistics of DeepSeek's Online Service:
⚡ 73.7k/14.8k…

— DeepSeek (@deepseek_ai) March 1, 2025

Diagram of DeepSeek’s Online Inference System

DeepSeek #OpenSourceWeek Day 6: Inference System Overview

This diagram depicts a system with two main components: Prefill and Decode services, each managed by load balancers for parallel processing. The API Server directs requests to these services. Both services utilize an optional external key-value cache (KVCache) for storage. The system is designed for efficient and scalable handling of API requests through parallel processing and caching.

Performance Statistics

The performance of the DeepSeek-V3/R1 inference system has been impressive. Over 24 hours, the system achieved the following statistics:

DeepSeek #OpenSourceWeek Day 6: Inference System Overview

  • Total Input Tokens: 608 billion, with 342 billion (56.3%) hitting the on-disk KV cache.
  • Total Output Tokens: 168 billion, with an average output speed of 20–22 tokens per second.
  • Average Throughput: Each H800 node delivered approximately 73.7k tokens/s for input and 14.8k tokens/s for output.

Cost and Revenue Analysis

The operational costs and revenue generated by the DeepSeek-V3/R1 system are noteworthy. The total daily cost for running the inference services, assuming a leasing cost of $2 per hour per H800 GPU, amounted to $87,072.

If all tokens were billed at DeepSeek-R1’s pricing, the theoretical total daily revenue would be $562,027, resulting in a remarkable cost profit margin of 545%. The pricing structure is as follows:

  • R1 Pricing:
    • $0.14/M for input tokens (cache hit)
    • $0.55/M for input tokens (cache miss)
    • $2.19/M for output tokens

However, actual revenue is lower due to several factors:

  • DeepSeek-V3’s pricing is significantly lower than R1.
  • Only a subset of services are monetized, with web and app access remaining free.
  • Nighttime discounts are applied during off-peak hours.

DeepSeek #OpenSourceWeek Day 6: Inference System Overview

Graph Overview

  • The Graph Displays Two Datasets: Cost (in yellow) and Theoretical Income (in blue) over 24 hours, from 12:00 to 12:00.
  • Data Trends: Theoretical income shows significant peaks during certain hours, indicating higher potential earnings, while costs remain relatively stable and low in comparison.
  • Time Analysis: Cost remains consistently low, suggesting efficient operations, while theoretical income fluctuates, hinting at varying levels of engagement or activity.

Notes: The theoretical income is based on API pricing calculations and does not reflect actual earnings.

For detailed analysis, please refer to the GitHub link of day 6 GitHub.

Previous Updates:

  • Day 1: Release of FlashMLA
  • Day 2: Release of DeepEP
  • Day 3: Release of DeepGEMM
  • Day 4: Optimized Parallelism Strategies
  • Day 5: Launch of 3FS and Smallpond Framework

Conclusion

The DeepSeek-V3/R1 inference system represents a significant advancement in the field of artificial intelligence, particularly in optimizing throughput and latency. Through the innovative use of cross-node Expert Parallelism, effective load balancing, and communication-computation overlapping, we have achieved impressive performance metrics.

As they continue to refine our systems and share insights with the community, they are contributing to the broader goals of artificial general intelligence (AGI). The insights gained from this week will not only enhance our understanding but also pave the way for future innovations in AI technology

They are encouraging the community to engage with these resources, as they provide valuable insights into the ongoing developments in the DeepSeek project and its implications for the future of AI.

The above is the detailed content of DeepSeek #OpenSourceWeek Day 6: Inference System Overview. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How to Run LLM Locally Using LM Studio? - Analytics VidhyaHow to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationGuy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaWhat is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics Vidhya12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsAV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsPerplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingEveryone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaRocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.