Home >Technology peripherals >AI >Nvidia releases TensorRT-LLM open source software to improve AI model performance on high-end GPU chips

Nvidia releases TensorRT-LLM open source software to improve AI model performance on high-end GPU chips

王林
王林forward
2023-09-14 12:29:051093browse

Nvidia发布TensorRT-LLM开源软件 提升高端GPU芯片上的AI模型性能

Nvidia recently announced the launch of a new open source software suite called TensorRT-LLM, which expands the capabilities of large language model optimization on Nvidia GPUs and breaks through artificial intelligence inference performance after deployment limit.

Generative AI large language models have become popular due to their impressive capabilities. It expands the possibilities of artificial intelligence and is widely used in various industries. Users can obtain information by talking to chatbots, summarize large documents, write software code, and discover new ways to understand information

Ian Buck, vice president of hyperscale and high-performance computing at Nvidia, said: "Large-scale language models Inference becomes increasingly difficult. It is natural for models to become smarter and larger as their complexity increases, but when models scale beyond a single GPU and must run on multiple When running on a GPU, it becomes a big problem. "

In terms of artificial intelligence, inference is a process in which the model processes new data that has never been seen before, such as for summarizing, generating code, Providing suggestions or answering questions is the workhorse of large language models.

With the rapid expansion of the model ecosystem, models are becoming larger and larger with richer functions. This also means that the model becomes so large that it cannot be run simultaneously on a single GPU and must be split. Developers and engineers must manually distribute and coordinate workloads to get responses in real time. TensorRT-LLM solves this problem by implementing "tensor parallelism", allowing large-scale and efficient inference on multiple GPUs

In addition, due to the variety of large-scale Language model, so Nvidia has optimized the core for the current mainstream large-scale language model. The software suite includes fully optimized, ready-to-run versions of large-scale language models, including Meta Platform’s Llama 2, OpenAI’s GPT-2 and GPT-3, Falcon, MosaicMPT, and BLOOM.

"On-the-fly batching" mechanism to deal with dynamic workloads

Due to the nature of large language models themselves, the workload of the model may be highly dynamic, the workload requirements and task usage It may also change over time, and a single model can be used simultaneously as a chatbot to ask and answer questions, and it can be used to summarize large documents and short documents. Therefore, the output size may be of completely different orders of magnitude.

In order to cope with these different workloads, TensorRT-LLM introduces a mechanism called "running batching", which is an optimized scheduling process that breaks the text generation process into multiple fragments. So that it can be moved in and out of the GPU so that the entire batch of workload does not need to be completed before starting a new batch.

Previously, if there was a large request, such as summarizing a very large document, everything behind it would have to wait for the process to complete before the queue could move forward.

Nvidia has been working with many vendors to optimize TensorRT-LLM, including Meta, Cohere, Grammarly, Databricks and Tabnine. With their help, Nvidia continues to streamline the functionality and toolset within its software suite, including the open source Python application user interface for defining and optimizing new architectures for customizing large language models.

For example, when MosaicML integrated TensorRT-LLM with its existing software stack, it added additional functionality on top of TensorRT-LLM. Naveen Rao, vice president of engineering at Databricks, said that the process is very simple

"TensorRT-LLM is easy to use, rich in features, including token streaming, dynamic batching, paged attention, quantification, etc., and it is very efficient. "Providing optimal performance for serving large language models using NVIDIA GPUs and allowing us to pass cost savings back to our customers."

Nvidia said that TensorRT-LLM and the benefits it brings, including Batch processing function) can increase the inference performance of article summarization using Nvidia H100 by more than 1x. When using the GPT-J-6B model to perform the A100 test on the CNN/Daily Mail article summary, using only H100 was 4 times faster than A100, and after enabling TensorRT-LLM optimization, the speed was increased by 8 times

TensorRT-LLM provides developers and engineers with a deep learning compiler, optimized large language model kernels, pre- and post-processing, multi-GPU/multi-node communication capabilities, and a simple open source API, allowing them to quickly optimize and Perform inference for large language model production. As large language models continue to reshape the data center, enterprises' demand for higher performance means that developers, more than ever, need tools that give them the functionality and access to deliver higher-performing results.

The TensorRT-LLM software suite is now available for early access to developers in the Nvidia Developer Program and will be integrated into the NeMo framework for production AI end-to-end software platform Nvidia AI Enterprise next month. The TensorRT-LLM software suite has been released for early access by developers in the Nvidia Developer Program and will be integrated into Nvidia AI Enterprise’s NeMo framework next month for a production AI end-to-end software platform

The above is the detailed content of Nvidia releases TensorRT-LLM open source software to improve AI model performance on high-end GPU chips. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete