Home  >  Article  >  Technology peripherals  >  Tesla Dojo supercomputing architecture details disclosed for the first time! "Fucked to pieces" for autonomous driving

Tesla Dojo supercomputing architecture details disclosed for the first time! "Fucked to pieces" for autonomous driving

PHPz
PHPzforward
2023-04-11 21:46:251250browse

To meet the growing demand for artificial intelligence and machine learning models, Tesla created its own artificial intelligence technology to teach Tesla cars to drive themselves.

Recently, Tesla disclosed a large number of details about the Dojo supercomputing architecture at the Hot Chips 34 conference.

Essentially, Dojo is a giant composable supercomputer built from a completely custom architecture covering computation, networking, input/output (I/O) chip to instruction set architecture (ISA), power delivery, packaging and cooling. All of this is done to run custom, specific machine learning training algorithms at scale.

Ganesh Venkataramanan is Tesla’s senior director of autonomous driving hardware and is responsible for the Dojo project and AMD’s CPU design team. At the Hot Chips 34 conference, he and a group of chip, system and software engineers unveiled many of the machine's architectural features for the first time.

Data Center "Sandwich"

" Generally speaking, our process of manufacturing chips is to put them on the package and put the package on the printed circuit board , and then it goes into the system. The system goes into the rack," Venkataramanan said.

But there’s a problem with this process: every time data moves from the chip to the package and off the package, there’s latency and bandwidth loss.

To get around these limitations, Venkataramanan and his team decided to start from scratch.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Thus, Dojo’s training tiles were born.

This is a self-contained computing cluster that takes up half a cubic foot and is capable of 556TFLOPS of FP32 performance in a 15kW liquid-cooled package.

Each tile is equipped with 11GB of SRAM and is connected via a 9TB/s fabric using a custom transport protocol throughout the stack.

Venkataramanan said: "This training board represents an unmatched level of integration from computer to memory, to power delivery, to communications, without the need for any additional switches."

The core of the training tile is Tesla’s D1, a 50 billion transistor chip based on TSMC’s 7nm process. Tesla says each D1 is capable of achieving 22TFLOPS of FP32 performance at a TDP of 400W.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Tesla then used 25 D1s, divided them into known good molds, and then used TSMC's on-wafer system technology Wrap them up to enable massive computing integration with extremely low latency and extremely high bandwidth.

However, the system design and vertical stacking architecture on the chip bring challenges to power delivery.

According to Venkataramanan, most current accelerators place the power supply directly next to the silicon wafer. He explained that this approach, while effective, meant that a large portion of the accelerator had to be dedicated to these components, which was impractical for Dojo. Therefore, Tesla chose to provide power directly through the bottom of the chip.

In addition, Tesla has also developed the Dojo Interface Processor (DIP), which is the bridge between the host CPU and the training processor.

Each DIP has 32GB of HBM, and up to five of these cards can be connected to a training tile at 900GB/s for a total of 4.5TB/s amount, each tile has a total of 160GB HBM.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Tesla’s V1 configuration pairs these tiles – or 150 D1 dies – in an array to support four host CPUs , equipped with five DIP cards per host CPU to achieve an exaflop of claimed BF16 or CFP8 performance.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Software

Such a specialized computing architecture requires a specialized software stack. However, Venkataramanan and his team recognized that programmability would determine Dojo's success or failure.

"When we design these systems, ease of programmability by software peers is paramount. Researchers don't wait for your software folks to write a handwritten kernel to accommodate the new algorithms we want to run. "

In order to do this, Tesla gave up the idea of ​​using the kernel and designed Dojo's architecture around the compiler.

"What we do is we use PiTorch. We create a middle layer that helps us parallelize to scale the hardware underneath it. Underneath everything is compiled code. "In order to create a software stack that can adapt to any future workload, this is the only way.

Despite emphasizing the flexibility of the software, Venkataramanan pointed out that the platform currently running in their lab is currently limited to Tesla.

Dojo Architecture Overview

After reading the above, let us take a deeper look at the Dojo architecture.

Tesla has an exascale artificial intelligence system for machine learning. Tesla has enough capital to hire employees and build chips and systems specifically for its applications, just like Tesla's in-car systems.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Tesla is not only building its own AI chip, but also a supercomputer.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Distributed system analysis

Each node of Dojo has Own CPU, memory and communication interfaces.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Dojo node

This is the processing pipeline of the Dojo processor.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Processing Pipeline

Each node has 1.25MB of SRAM. In AI training and inference chips, a common technique is to co-locate memory with computation to minimize data transfers, which are very expensive from a power and performance perspective.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Node memory

Then each node is connected to a 2D grid.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Network Interface

This is an overview of the data path.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Data Path

Here is an example of what the chip can do list parsing.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

List parsing

More about the instruction set here , is a Tesla original, rather than a typical Intel, Arm, NVIDIA or AMD CPU/GPU instruction set.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Instruction set

In artificial intelligence, arithmetic format is very important, especially what the chip supports Format. Using DOJO, Tesla can study common formats such as FP32, FP16, and BFP16. These are common industry formats.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Arithmetic Format

Tesla is also working on configurable FP8 or CFP8. It comes in 4/3 and 5/2 range options. This is similar to the NVIDIA H100 Hopper configuration of FP8. We also see the Untether.AI Boqueria 1458 RISC-V core AI accelerator focusing on different FP8 types.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Arithmetic Format 2

Dojo also has a different CFP16 format, to achieve higher accuracy and support FP32, BFP16, CFP8 and CFP16.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Arithmetic Format 3

These cores are then integrated into the fabricated in the mold. Tesla's D1 chip is manufactured by TSMC using a 7nm process. Each chip has 354 Dojo processing nodes and 440MB of SRAM.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

First Integration Box D1 Mold

These D1 chips are packaged in On a dojo training tile. The D1 chips are tested and then assembled into a 5×5 tile. These tiles have 4.5TB/s bandwidth per edge. They also have a power delivery envelope of 15kW per module, or roughly 600W per D1 chip after subtracting the power used by the 40 I/O dies. The comparison shows why something like Lightmatter Passage would be more attractive if a company didn't want to design such a thing.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Secondary integration box Dojo training tile

Dojo interface The processor is located at the edge of the 2D grid. Each training block has 11GB of SRAM and 160GB of shared DRAM.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Dojo system topology

The following is the 2D network connecting the processing nodes Grid bandwidth data.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Dojo system communication logic two-dimensional grid

Each DIP Provides a 32GB/s link to the host system.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

##Dojo system communication PCIe link DIP and host

Tesla also has Z-plane links for longer routes. In the rest of the speech, Tesla talked about system-level innovation.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Communication mechanism

This is the delay boundary of die and tiles, That's why they are handled differently in Dojo. The reason Z-plane links are needed is that long paths are expensive.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Dojo system communication mechanism

Any processing node can cross the system Access data. Each node can push or pull data to SRAM or DRAM.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Dojo system batch communication

Dojo uses a flat addressing scheme communication.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

System Network 1

These chips can be bypassed in software Wrong processing node.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

System Network 2

This means that the software must understand the system topology .

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

System Network 3

Dojo does not guarantee end-to-end traffic ordering , so packets need to be counted at the destination.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

System Network 4

Here's how packets are counted into the system part of synchronization.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

System synchronization

The compiler needs to define a Tree

. Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

System synchronization 2

Tesla said that one exa-pod has more than 1 million CPU (or compute node). These are large systems.

Tesla Dojo supercomputing architecture details disclosed for the first time! Fucked to pieces for autonomous driving

Summary

Tesla built the Dojo specifically to work at scale. Typically, startups look to build one or a few AI chips per system. Clearly, Tesla is focused on greater scale.

In many ways, it makes sense for Tesla to have a huge AI training ground. What's even more exciting is that it's not only using commercially available systems, but it's also building its own chips and systems. Some ISAs on the scalar side are borrowed from RISC-V, but the vector side and many of the architectures Tesla has customized, so this requires a lot of work.

The above is the detailed content of Tesla Dojo supercomputing architecture details disclosed for the first time! "Fucked to pieces" for autonomous driving. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete