H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models-AI-php.cn

Home

Technology peripherals

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

王林

Sep 10, 2023 pm 04:41 PM

ModelNvidia

The "GPU poor" are about to bid farewell to their predicament!

Just now, NVIDIA released an open source software called TensorRT-LLM, which can accelerate the inference process of large language models running on H100

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

So, how many times can it be improved?

After adding TensorRT-LLM and its series of optimization functions (including In-Flight batch processing), the total throughput of the model increased by 8 times.

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

Comparison of GPT-J-6B A100 and H100 with and without TensorRT-LLM

In addition, taking Llama 2 as an example, TensorRT-LLM can improve inference performance by 4.6 times compared to using A100 independently

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

##Compare Llama 2 70B , Comparison of A100 and H100 with and without TensorRT-LLM

Netizens said that the super powerful H100, combined with TensorRT-LLM, will undoubtedly completely change large-scale language models Reasoning about the current situation!

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

TensorRT-LLM: Large model inference acceleration artifact

Currently, due to the huge parameter scale of large models, "deployment and reasoning" difficulty and cost have remained high.

TensorRT-LLM developed by NVIDIA aims to significantly improve the throughput of LLM and reduce costs through GPU

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

Specifically, TensorRT-LLM encapsulates TensorRT’s deep learning compiler, FasterTransformer’s optimized kernel, pre- and post-processing, and multi-GPU/multi-node communication into a simple open source Python API

NVIDIA has further enhanced FasterTransformer to make it a productized solution.

It can be seen that TensorRT-LLM provides an easy-to-use, open source and modular Python application programming interface.

Coders who don’t need in-depth knowledge of C or CUDA can deploy, run, and debug a variety of large language models and achieve excellent performance and rapid customization. Function

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

According to NVIDIA’s official blog, TensorRT-LLM uses four methods to improve LLM inference performance on Nvidia GPUs

First of all, TensorRT-LLM is introduced for the current top 10 models, allowing developers to run them immediately.

Secondly, TensorRT-LLM, as an open source software library, allows LLM to perform inference on multiple GPUs and multiple GPU servers simultaneously.

These servers are connected through NVIDIA's NVLink and InfiniBand interconnects respectively.

The third point is about "in-machine batch processing", which is a new scheduling technology that allows tasks of different models to enter and exit the GPU independently of other tasks

Finally, TensorRT-LLM has been optimized to use the H100 Transformer Engine to reduce memory usage and latency during model inference.

Let’s take a detailed look at how TensorRT-LLM improves model performance

Supports rich LLM ecosystem

TensorRT -LLM provides excellent support for the open source model ecosystem

What needs to be rewritten is: The largest and most advanced language model, such as Llama 2-70B launched by Meta, needs Multiple GPUs work together to provide responses in real time

Previously, to achieve the best performance of LLM inference, developers had to manually rewrite the AI model and break it into multiple pieces. Then coordinate execution between GPUs

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

TensorRT-LLM uses tensor parallel technology to distribute the weight matrix to each device, thereby simplifying the process and enabling large-scale efficient inference

Each model can run in parallel on multiple GPUs and multiple servers connected via NVLink, without developer intervention or model changes.

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

With the introduction of new models and model architectures, developers can optimize using the latest NVIDIA AI kernel (Kernal) open sourced in TensorRT-LLM Model

The content that needs to be rewritten is: the supported kernel fusion (Kernal Fusion) includes the latest FlashAttention implementation, and masked multi-head attention for the context and generation stages of GPT model execution Li et al

In addition, TensorRT-LLM also includes fully optimized, ready-to-run versions of many large language models that are currently popular.

These models include Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, BLOOM and more than ten more. All of these models can be called using the easy-to-use TensorRT-LLM Python API

These features help developers build customized large language models faster and more accurately to Meet the different needs of all walks of life.

In-flight batch processing

Nowadays, large language models are used in a wide range of applications.

A single model can be used simultaneously for multiple seemingly disparate tasks - from simple Q&A responses in a chatbot, to document summarization or the generation of long code blocks, the workload is Highly dynamic, the output size needs to meet the needs of tasks of varying orders of magnitude.

The diversity of tasks can make it difficult to batch requests efficiently and perform efficient parallel execution, possibly causing some requests to complete earlier than others.

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

To manage these dynamic loads, TensorRT-LLM includes an optimized scheduling technology called "In-flight batching".

The core principle of the large language model is that the entire text generation process can be realized through multiple iterations of the model

Through in-flight batch processing , the TensorRT-LLM runtime immediately releases completed sequences from the batch, rather than waiting for the entire batch to complete before moving on to the next set of requests.

When executing a new request, other requests from the previous batch that have not been completed are still being processed.

By doing on-board batching and doing additional kernel-level optimizations, GPU utilization can be improved, resulting in at least doubling the throughput of the actual request benchmark for LLM on the H100

Using FP 8’s H100 Transformer Engine

TensorRT-LLM also provides a function called H100 Transformer Engine, which can effectively reduce memory consumption and latency during large model inference. .

Because LLM contains billions of model weights and activation functions, they are usually trained and represented with FP16 or BF16 values, each occupying 16 bits of memory.

However, at inference time, most models can be effectively represented with lower precision using quantization techniques, such as 8-bit or even 4-bit integers (INT8 or INT4).

Quantization is the process of reducing model weights and activation accuracy without sacrificing accuracy. Using lower precision means each parameter is smaller and the model takes up less space in GPU memory.

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

This can use the same hardware to infer larger models while reducing time consumption on memory operations during execution

Through H100 Transformer Engine technology, the H100 GPU with TensorRT-LLM allows users to easily convert model weights to the new FP8 format, and automatically compile the model to take advantage of the optimized FP8 Kernel.

And this process does not require any code! The FP8 data format introduced by H100 enables developers to quantify their models and dramatically reduce memory consumption without reducing model accuracy.

Compared with other data formats such as INT8 or INT4, FP8 quantization retains higher precision while achieving the fastest performance and is the most convenient to implement. Compared with other data formats such as INT8 or INT4, FP8 quantization retains higher precision while achieving the fastest performance and is the most convenient to implement

How to obtain TensorRT-LLM

Although TensorRT-LLM has not yet been officially released, users can now experience it in advance

The application link is as follows:

https://developer.nvidia.com/tensorrt-llm-early-access/join

NVIDIA also said that TensorRT-LLM will be integrated into the NVIDIA NeMo framework soon.

This framework is part of the AI Enterprise recently launched by NVIDIA, providing enterprise customers with a secure, stable, and highly manageable enterprise-level AI software platform

Developers and researchers can access TensorRT-LLM through the NeMo framework on NVIDIA NGC or the project on GitHub

However, it is important to note that, Users must register for the NVIDIA Developer Program to apply for the early access version.

Netizen hot discussion

Users on Reddit had a heated discussion on the release of TensorRT-LLM

Unimaginable After optimizing the hardware specifically for LLM, the effect will be much improved.

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

But some netizens believe that the purpose of this thing is to help Lao Huang sell more H100s.

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

Some netizens have different opinions on this. They believe that Tensor RT is also helpful for users who deploy deep learning locally. As long as you have an RTX GPU, you may benefit from similar products in the future

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

From a more macro perspective, maybe for LLM, it will There will be a series of optimization measures specifically targeted at the hardware level, and there may even be hardware specifically designed for LLM to improve its performance. This situation has occurred in many popular applications, and LLM is no exception

H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

The above is the detailed content of H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

undress free porn AI tool websiteMay 13, 2025 am 11:26 AM

https://undressaitool.ai/ is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How to create pornographic images/videos using undressAIMay 13, 2025 am 11:26 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

undress AI official website entrance website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How does undressAI generate pornographic images/videos?May 13, 2025 am 11:26 AM

undressAI porn AI official website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

UndressAI usage tutorial guide articleMay 13, 2025 am 10:43 AM

[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyrightMay 13, 2025 am 01:57 AM

The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

Explaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsMay 13, 2025 am 01:53 AM

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6

Visual web development tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1

Easy-to-use and free code editor

Hot Topics

1668

1426

1329

1273

1256