


Although Google deployed the most powerful AI chip at the time, TPU v4, in its own data center as early as 2020.
But it was not until April 4 this year that Google announced the technical details of this AI supercomputer for the first time.
##Paper address: https://arxiv.org/abs/2304.01433
Compared with TPU v3, the performance of TPU v4 is 2.1 times higher, and after integrating 4096 chips, the performance of the supercomputer is increased by 10 times.
In addition, Google also claims that its own chip is faster and more energy-efficient than Nvidia A100.
In the paper, Google stated that for systems of comparable size, TPU v4 It can provide 1.7 times better performance than NVIDIA A100, while also improving energy efficiency by 1.9 times.
In addition, Google’s supercomputing speed is about 4.3 times to 4.5 times faster than Graphcore IPU Bow.
Google demonstrated the TPU v4 package, as well as 4 packages mounted on the circuit board.
Like TPU v3, each TPU v4 contains two TensorCore (TC). Each TC contains four 128x128 matrix multiplication units (MXU), a vector processing unit (VPU) with 128 channels (16 ALUs per channel), and 16 MiB vector memory (VMEM).
Two TCs share a 128 MiB common memory (CMEM).
It is worth noting that the A100 chip and Google’s fourth-generation TPU were launched at the same time, so how is their specific performance compared?
Google demonstrated the fastest performance of each DSA on 5 MLPerf benchmarks separately. These include BERT, ResNET, DLRM, RetinaNet, and MaskRCNN.
Among them, Graphcore IPU submitted results in BERT and ResNET.
The results of the two systems on ResNet and BERT are shown below. The dotted lines between the points are interpolations based on the number of chips.
MLPerf results for both TPU v4 and A100 scale to larger systems than the IPU (4096 chips vs. 256 chips).
For similarly sized systems, TPU v4 is 1.15 times faster than A100 on BERT and approximately 4.3 times faster than IPU. For ResNet, TPU v4 is 1.67x and about 4.5x faster respectively.
For power usage on the MLPerf benchmark, the A100 used 1.3x to 1.9x more power on average.
Does peak floating point operations per second predict actual performance? Many people in the machine learning field believe that peak floating point operations per second is a good proxy for performance, but in fact it is not.
For example, TPU v4 is 4.3x to 4.5x faster on two MLPerf benchmarks than IPU Bow on the same size system, despite only having a 1.10x advantage in peak floating point operations per second.
Another example is that the A100's peak floating point operations per second is 1.13 times that of TPU v4, but for the same number of chips, TPU v4 is 1.15 to 1.67 times faster.
The following figure uses the Roofline model to show the relationship between peak FLOPS/second and memory bandwidth.
So, the question is, why doesn’t Google compare with Nvidia’s latest H100?
Google said that because the H100 was built using newer technology after the launch of Google's chips, it did not compare its fourth-generation product to Nvidia's current flagship H100 chip.
However, Google hinted that it is developing a new TPU to compete with Nvidia H100, but did not provide details. Google researcher Jouppi said in an interview with Reuters that Google has "a production line for future chips."
TPU vs GPU
While ChatGPT and Bard are "fighting it out", the two behemoths are also working hard behind the scenes to keep them running - NVIDIA CUDA support GPU (Graphics Processing Unit) and Google's customized TPU (Tensor Processing Unit).
In other words, this is no longer about ChatGPT vs. Bard, but TPU vs. GPU, and how efficiently they can do matrix multiplication.
Due to their excellent design in hardware architecture, NVIDIA’s GPUs are ideally suited for matrix multiplication tasks – effectively switching between multiple CUDA cores Implement parallel processing.
Therefore, since 2012, training models on GPU has become a consensus in the field of deep learning, and it has not changed to this day.
With the launch of NVIDIA DGX, NVIDIA is able to provide one-stop hardware and software solutions for almost all AI tasks, which competitors cannot provide due to lack of intellectual property rights. .
In contrast, Google launched the first-generation tensor processing unit (TPU) in 2016, which not only included a custom ASIC (dedicated integrated circuit), and is also optimized for its own TensorFlow framework. This also gives TPU an advantage in other AI computing tasks besides matrix multiplication, and can even accelerate fine-tuning and inference tasks.
In addition, researchers at Google DeepMind have also found a way to create a better matrix multiplication algorithm-AlphaTensor.
However, even though Google has achieved good results through self-developed technology and emerging AI computing optimization methods, Microsoft and Nvidia’s long-term in-depth cooperation has relied on their respective expertise in the industry. The accumulation of products has simultaneously expanded the competitive advantages of both parties.
Fourth generation TPU
## Back in 2021 at the Google I/O conference, Pichai announced it for the first time Google's latest generation AI chip TPU v4.
"This is the fastest system we have deployed on Google and is a historic milestone for us."
This improvement has become a key point in the competition between companies building AI supercomputers, because large language models like Google’s Bard or OpenAI’s ChatGPT have been implemented at parameter scale. Explosive growth.
This means that they are far larger than the capacity that a single chip can store, and the demand for computing power is a huge "black hole".
So these large models have to be distributed across thousands of chips, and then those chips have to work together for weeks, or even longer, to train the models.
Currently, Google’s largest language model publicly disclosed so far, PaLM, has 540 billion parameters, which was divided into two 4,000-chip supercomputers for training within 50 days. of.
Google said its supercomputers can easily reconfigure the connections between chips to avoid problems and perform performance tuning.
Google researcher Norm Jouppi and Google distinguished engineer David Patterson wrote in a blog post about the system,
"Circuit switching enables bypassing It becomes easy to overcome failed components. This flexibility even allows us to change the topology of the supercomputer interconnection to accelerate the performance of machine learning models."
Although Google is only now releasing relevant Details of its supercomputer, which has been online since 2020 at a data center located in Mayes County, Oklahoma.
Google said that Midjourney used this system to train its model, and the latest version of V5 allows everyone to see the amazing image generation.
Recently, Pichai said in an interview with the New York Times that Bard will be transferred from LaMDA to PaLM.
Now with the blessing of TPU v4 supercomputer, Bard will only become stronger.
The above is the detailed content of Google's super AI supercomputer crushes NVIDIA A100! TPU v4 performance increased by 10 times, details disclosed for the first time. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

Zend Studio 13.0.1
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6
Visual web development tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.