Use DDC to build AI networks? This might just be a beautiful illusion-AI-php.cn

Home

Technology peripherals

Use DDC to build AI networks? This might just be a beautiful illusion

PHPz

May 11, 2023 pm 01:46 PM

ai network

Use DDC to build AI networks? This might just be a beautiful illusion

ChatGPT, AIGC, large model... A series of dazzling terms have emerged, and the commercial value of AI has attracted great attention from society. As the scale of training models increases, the data center network that supports AI computing power has also become a hot topic. Improve computing power efficiency and build high-performance networks... Major manufacturers are showing their talents and working hard to open up a "new F1 track" for AI networks in the Ethernet industry. In this AI arms race, DDC made a high-profile appearance and overnight seemed to become synonymous with revolutionary technology for building high-performance AI networks. But is it really as beautiful as it seems? Let us analyze in detail and judge calmly.

Started in 2019, the essence of DDC is to replace the frame router with a box router

With the rapid growth of DCN traffic, the need for DCI network upgrades is becoming increasingly urgent. However, the expansion capacity of DCI router frame equipment is limited by the size of the frame; at the same time, the equipment consumes high power. When expanding the frame, the requirements for cabinet power and heat dissipation are high, and the transformation cost is high. Against this background, in 2019 AT&T submitted box router specifications based on commercial chips to OCP and proposed the concept of DDC (Disaggregated Distributed Chassis). To put it simply, DDC uses a cluster composed of several low-power boxed devices to replace hardware units such as service line cards and network boards of modular devices. The boxed devices are interconnected through cables. The entire cluster is managed through a centralized or distributed NOS (network operating system) in order to break through the performance and power consumption bottlenecks of DCI single-frame equipment.

Use DDC to build AI networks? This might just be a beautiful illusion The advantages claimed by DDC include:

Breaking through the expansion limitations of frame-type equipment

: Capacity expansion is achieved through multi-device clusters, without machine control Frame size limit;

Reducing single-point power consumption

: Multiple low-power box-type devices are deployed in a distributed manner, which solves the problem of concentrated power consumption and reduces cabinet power and heat dissipation requirements;

Improve bandwidth utilization

: Compared with the traditional ETH network Hash exchange, DDC uses cell (Cell) exchange and load balancing based on Cell, which helps To improve bandwidth utilization;

Use DDC to build AI networks? This might just be a beautiful illusion

mitigating packet loss

: Use the device’s large cache capability to meet the high convergence ratio requirements of DCI scenarios. First, the VOQ (Virtual Output Queue) technology is used to allocate the packets received in the network to different virtual outqueues, and then the Credit communication mechanism is used to determine that the receiving end has enough buffer space before sending these packets, thereby reducing the risk of Packet loss caused by egress congestion.

Use DDC to build AI networks? This might just be a beautiful illusion

The DDC solution is only a flash in the pan in the DCI scene

The idea seems perfect, but the implementation is not smooth sailing. DriveNets' Network Cloud product is the industry's first and only commercial DDC solution, and the entire software is adapted to universal white-box routers. However, no clear sales cases have been seen on the market so far. AT&T, as the proposer of the DDC architecture solution, deployed the DDC solution on a gray scale in its self-built

backbone network in 2020, but there has been little follow-up. Why didn't this splash make much waves? This should be attributed to the four major flaws of DDC. Defect 1: Unreliable equipment management and control plane

Each component of the frame-type equipment realizes the control and management plane interconnection through the highly integrated and highly reliable PCIe bus. All equipment uses a dual main control board design to ensure high reliability of the equipment's management and control plane. DDC uses "replace if broken" vulnerable module cables to interconnect to build a multi-device cluster and support the operation of the cluster management and control plane. Although it breaks through the scale of box-type equipment, this unreliable interconnection method brings great risks to the management and control surface. When two devices are stacked, problems such as split brain and out-of-synchronization of table entries may occur. For the unreliable management and control plane of DDC, this kind of problem is more likely to occur.

Defect 2: Highly complex equipment NOS

The SONiC community has already designed a distributed forwarding frame based on the VOQ architecture, and continues to iteratively supplement and modify it to meet the support for DDC. Although there are indeed many implementation cases of white box, few people challenge the "white box". To build a remote "white frame", we not only need to consider the status of multiple devices in the cluster, the synchronization and management of table entry information, but also need to consider multiple practical scenarios such as version upgrades, rollbacks, and hot patches under multiple devices. systematic implementation. DDC has exponentially increased NOS complexity requirements for clusters. Currently, there are no mature commercial cases in the industry, and there are great development risks.

Defect 3: Lack of maintainable solutions

The network is unreliable, so the ETH network has made a lot of maintainable and positionable features or tools, such as the familiar INT, MOD. These tools can monitor specific flows and identify flow characteristics of packet loss to locate and troubleshoot problems. However, the cell used by DDC is only a slice of the message. It does not have five-tuple information such as related IP and cannot be associated with a specific service flow. Once packet loss occurs in DDC, the current operation and maintenance methods cannot locate the packet loss point, and the maintenance plan is seriously lacking.

Defect 4: Cost increase

In order to break through the frame size limitation, DDC needs to interconnect the various devices in the cluster through high-speed cables/modules; the interconnection cost is far Line cards and network boards of higher than frame-type equipment are interconnected through PCB traces and high-speed links, and the larger the scale, the higher the interconnection cost.

At the same time, in order to reduce the concentration of power consumption at a single point, the overall power consumption of a DDC cluster interconnected through cables/modules is higher than that of frame-type devices. For chips of the same generation, assuming that DDC cluster devices are interconnected by modules, the power consumption of the cluster is 30% higher than that of frame-type devices.

Refuse to fry the leftovers, the DDC solution is also not suitable for AI networks

The immaturity and imperfection of the DDC solution has made it sadly withdrawn from the DCI scene. But currently, it has made a resurgence under the pressure of AI. The author believes that DDC is also not suitable for AI networks. Next, we will analyze it in detail.

Two core demands of AI networks: high throughput and low latency

The services supported by AI networks are characterized by a small number of flows and a large bandwidth of a single flow; At the same time, the traffic flow is uneven, and there are often situations where one or more are hit (All-to-All and All-Reduce). Therefore, it is extremely prone to problems such as uneven traffic load, low link utilization, packet loss caused by frequent traffic congestion, etc., and cannot fully release computing power.

DDC only solves the Hash problem, but also brings many defects

DDC uses cell switching to slice the message into Cells, and uses polling based on reachability information mechanism is sent. The traffic load will be distributed to each link in a relatively balanced manner, fully utilizing the bandwidth and better solving the hash problem. But apart from this, DDC still has four major flaws in the AI scenario.

Defect 1: The hardware requires specific equipment and is not universal for closed private networks

The cell switching and VOQ technologies in the DDC architecture all rely on specific hardware chips for implementation. Currently, DCN network equipment cannot be reused. The rapid development of the ETH network benefits from its plug-and-play convenience, generalization and standardization. DCC relies on hardware and builds a closed private network through a proprietary switching protocol, which is not universal.

Defect 2: The large cache design increases network costs and is not suitable for large-scale DCN networking

If the DDC solution enters the DCN, in addition to high interconnection costs, it also bears the burden of This reduces the cost burden of the large cache on the chip. DCN networks currently use small cache devices, with a maximum of only 64M; DDC solutions derived from DCI scenarios usually have a chip HBM of over GB. Compared with DCI, large-scale DCN networks are more concerned about network costs.

Defect 3: The static network delay increases and does not match the AI scenario

As a high-performance AI network that releases computing power, the goal is to shorten the completion time of services. The large cache capability of DDC caches packets, which will inevitably increase the static delay of hardware forwarding. At the same time, cell switching, slicing, encapsulation and reassembly of messages also increase the network forwarding delay. Through test data comparison, DDC forwarding delay increases by 1.4 times compared with traditional ETH network.

Defect 4: As the scale of DC increases, the problem of unreliability of DDC will worsen

Compared to the scenario where DDC replaces frame equipment in DCI scenarios, DDC needs to satisfy a larger cluster to enter DCN, or at least one network POD. This means that the "box" is further apart, and the components are further apart. Then there are higher requirements for the reliability of the management and control plane of this cluster, the synchronization management of the device network NOS, and the network POD-level operation and maintenance management. DDC's various flaws will crack.

DDC is at best a transitional solution

Of course, no problem is unsolvable. Accepting some constraints, this specific scenario can easily become a stage for major manufacturers to "show off their skills". The network pursues reliability, simplicity, and efficiency, and rejects complexity. Especially under the current background of "reducing staff and increasing efficiency", we really need to consider the cost of implementing DDC.

Faced with the problem of network load sharing in AI scenarios, many cases have been solved through global static or dynamic orchestration of forwarding paths. In the future, it can also be solved through the network card on the end side based on Packet Spray and out-of-order Solved by rearrangement. Therefore, DDC is at best a short-term transition plan.

After a deep dive, the driving force behind DDC may be DNX

Finally, let’s talk about the mainstream network chip companyBroadcom (Broadcom), we compare The two familiar product series are StrataXGS and StrataDNX. XGS continues the high-bandwidth, low-cost route, quickly launches small cache, large-bandwidth chip products, and continues to dominate the DCN network occupancy rate. StrataDNX, however, carries the cost of a large cache and continues the myth of VOQ cell exchange, hoping that DDC will enter DC to continue its life. There seems to be no case in North America. The domestic DDC may be the last straw for DNX.

Today, a large number of hardware facilities such as GPUs have been restricted to a certain extent in our country. Do we really need DDC? Let’s leave more opportunities for domestically produced devices!

The above is the detailed content of Use DDC to build AI networks? This might just be a beautiful illusion. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Saving in R.E.P.O. Explained (And Save Files)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks agoByDDD

Hot Tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.