Home  >  Article  >  Technology peripherals  >  Microsoft revealed that it spent hundreds of millions of dollars to assemble a supercomputer for OpenAI to develop ChatGPT, using tens of thousands of Nvidia chips

Microsoft revealed that it spent hundreds of millions of dollars to assemble a supercomputer for OpenAI to develop ChatGPT, using tens of thousands of Nvidia chips

王林
王林forward
2023-04-12 13:52:03853browse

Microsoft revealed that it spent hundreds of millions of dollars to assemble a supercomputer for OpenAI to develop ChatGPT, using tens of thousands of Nvidia chips

News on March 14th, on Monday, local time in the United States, Microsoft issued a document revealing that it spent hundreds of millions of dollars to help OpenAI assemble an AI supercomputer to help develop the popular chat Robot ChatGPT. This supercomputer uses tens of thousands of NVIDIA graphics chips A100, which allows OpenAI to train increasingly powerful AI models.

OpenAI attempts to train larger and larger AI models, which are taking in more data and learning more and more parameters, which are the variables that the AI ​​system figures out through training and retraining. This means that it will take a long time for OpenAI to gain strong cloud computing service support.

To address this challenge, when Microsoft invested $1 billion in OpenAI in 2019, the company agreed to assemble a massive, cutting-edge supercomputer for the AI ​​research startup. The problem is that Microsoft doesn't have anything OpenAI needs, and isn't entirely sure it can build such a massive device without disrupting its Azure cloud service.

To do this, Microsoft had to find ways to connect tens of thousands of Nvidia's A100 graphics chips and change the way servers were placed on racks to prevent power outages. The A100 graphics chip is the workhorse for training AI models. Scott Guthrie, Microsoft's executive vice president of cloud computing and AI, did not disclose the specific cost of the project, but suggested it could be in the hundreds of millions of dollars.

Nidhi Cappell, general manager of Microsoft Azure AI Infrastructure, said: "We built a system architecture that can run at a very large scale and is very reliable, which is the success of ChatGPT. Important reasons. This is just one model we derived from it, and there will be many other models."

Based on this technology, OpenAI released the popular chatbot ChatGPT. Within days of launching last November, it attracted more than 1 million users and is now being incorporated into other companies' business models. As enterprise and consumer interest in generative artificial intelligence (AIGC) tools like ChatGPT grows, cloud service providers such as Microsoft, Amazon, and Google will face greater pressure to ensure that their data centers can provide all Requires huge computing power.

Meanwhile, Microsoft is now also starting to use the infrastructure it built for OpenAI to train and run its own large-scale AI models, including the new Bing search chatbot launched last month. The company also sells the system to other customers. The software giant is already working on the next generation of AI supercomputers as part of Microsoft's expanded partnership with OpenAI, a deal in which Microsoft added $10 billion to its investment.

Guthrie said in an interview: "We didn't customize anything for OpenAI, although it was customized initially, but we always built it in a general way so that anyone who wanted to Anyone who trains large language models can take advantage of the same improved technology. This really helps us become a better AI intelligent cloud on a broader scale. There are a lot of interconnected graphics processing units, like the AI ​​supercomputers Microsoft assembles. Once the model is in use, answering all questions posed by user queries (a process called inference) requires a slightly different setup. To this end, Microsoft also deploys graphics chips for inference, but these processors (thousands of them) are geographically dispersed across the company's more than 60 data centers. Microsoft said it is now adding the latest Nvidia H100 graphics chip for AI workloads and the latest version of Infiniband networking technology to share data faster.

The new Bing is still in the testing phase, and Microsoft is gradually adding more users from the test list. Guthrie's team holds daily meetings with about 20 employees, whom he calls the "pit crew," originally referring to the mechanics of the team's maintenance team in racing. These people's job is to figure out how to bring more computing power online quickly and solve problems that crop up.

Cloud services rely on thousands of different components, including servers, pipes, buildings’ concrete, different metals and minerals, and a delay or shortage of any one component, no matter how minor, can disrupt the entire Project interrupted. Recently, the 'pit crew' had to help with a shortage of cable trays, the basket-like devices that hold cables running down machines. So they designed a new cable tray. Guthrie said they are also working on how to squeeze as many servers as possible into existing data centers around the world so they don't have to wait for new buildings to be completed.

When OpenAI or Microsoft starts training large AI models, the work needs to be done all at once. Work is distributed to all GPUs, and at some point, these GPUs need to talk to each other to share the work they are doing. For AI supercomputers, Microsoft must ensure that the network equipment that handles communication between all the chips can handle this load, and it must develop software that maximizes the use of GPUs and network equipment. The company has now developed software that can be used to train AI models with tens of trillions of parameters. Because all the machines were booted at the same time, Microsoft had to consider where to place them and where the power supplies should be placed. Otherwise, the data center may eventually lose power. Microsoft also has to make sure it can cool all those machines and chips, said Alistair Speirs, director of global infrastructure at Azure. The company uses evaporation in cooler climates; cooling methods, in cooler climates. Using outside air, use a high-tech swamp cooler in hot climates.

Guthrie said Microsoft will continue to work on custom server and chip designs, as well as ways to optimize the supply chain to maximize speed, efficiency and cost savings. He said: "The AI ​​models that are amazing the world now are built on the supercomputer we started building several years ago, and the new model will use the new supercomputer we are training now. This computer is much larger and can achieve greater Complex technology.”

Microsoft is already working to make Azure’s AI capabilities more powerful, launching new virtual machines that use Nvidia’s H100 and A100 Tensor Core GPUs, as well as Quantum-2 InfiniBand networking. Microsoft says this will allow OpenAI and other companies that rely on Azure to train larger, more complex AI models.

Eric Boyd, enterprise vice president of Azure AI at Microsoft, said in a statement: "We found that we needed to build specialized clusters focused on supporting high-volume training efforts, and OpenAI is one of the early pieces of evidence. We are working closely with them to understand the key conditions they need when setting up a training environment and other things they need." (Xiao Xiao)

The above is the detailed content of Microsoft revealed that it spent hundreds of millions of dollars to assemble a supercomputer for OpenAI to develop ChatGPT, using tens of thousands of Nvidia chips. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete