


The inference speed of large models has doubled in just one month!
Recently, Nvidia announced the launch of a "chicken blood package" specially designed for H100, aiming to speed up the LLM inference process
Maybe now you don't have to wait for the GH200 to be delivered next year. .
The computing power of GPU has been affecting the performance of large models. Both hardware suppliers and users hope to obtain faster computing speed
As the largest supplier of hardware behind large models, NVIDIA has been studying how to accelerate large model hardware.
Through cooperation with a number of AI companies, NVIDIA finally launched the large model inference optimization program TensorRT-LLM (tentatively referred to as TensorRT).
TensorRT can not only double the inference speed of large models, but is also very convenient to use.
Without in-depth knowledge of C and CUDA, you can quickly customize optimization strategies and run large models faster on H100.
NVIDIA scientist Jim Fan forwarded and commented that NVIDIA’s “another advantage” is the supporting software that can maximize the use of GPU performance.
NVIDIA injects new vitality into its products through software, just like it implements Lao Huang's saying "the more you buy, the more you save." However, this does not prevent some people from thinking that the price of the product is too high
In addition to the price, some netizens also questioned its operating effect:
We always I have seen how many times the performance has improved (as advertised), but when I run Llama 2 myself, I can still only process dozens of tokens per second.
For TensorRT, we need further testing to determine whether it is really effective. Let us first take a closer look at TensorRT
Double the inference speed of large models
TensorRT-LLM optimized H100, how fast is it for running large models?
Nvidia’s announcement provides data for two models, Llama 2 and GPT-J-6B.
On the optimized H100, the inference speed of running Llama 2 is 4.6 times that of the A100 and 1.77 times that of the unoptimized H100 in August
The inference speed of GPT-J-6B is 8 times that of A100 and 2 times that of the August unoptimized version.
TensorRT also provides an open source modular Python API that can quickly customize optimization solutions according to different LLM requirements
This API will combine the deep learning compiler with , kernel optimization, pre/post-processing and multi-node communication functions are integrated together.
There are also customized versions for common models such as GPT(2/3) and Llama, which can be "used out of the box".
Through the latest open source AI kernel in TensorRT, developers can also optimize the model itself, including the attention algorithm FlashAttention that greatly speeds up Transformer.
TensorRT is a high-performance inference engine for optimizing deep learning inference. It optimizes LLM inference speed by using technologies such as mixed-precision computing, dynamic graph optimization, and layer fusion. Specifically, TensorRT improves inference speed by reducing the amount of computation and memory bandwidth requirements by converting floating-point calculations into half-precision floating-point calculations. In addition, TensorRT also uses dynamic graph optimization technology to dynamically select the optimal network structure based on the characteristics of the input data, further improving the inference speed. In addition, TensorRT also uses layer fusion technology to merge multiple computing layers into a more efficient computing layer, reducing computing and memory access overhead and further improving inference speed. In short, TensorRT has significantly improved the speed and efficiency of LLM inference through a variety of optimization technologies
First of all, it is due to TensorRT's optimization of multi-node collaborative working.
A huge model like Llama cannot be run on a single card. It requires multiple GPUs to run together.
In the past, this work required people to manually disassemble the model to achieve it.
With TensorRT, the system can automatically split the model and run it efficiently between multiple GPUs through NVLink
Secondly, TensorRT also An optimized scheduling technology called Dynamic Batch Processing is used.
During the inference process, LLM is actually performed by executing model iterations multiple times
Dynamic batch processing technology will kick out the completed sequence immediately instead of waiting for the entire batch of tasks Once complete, process the next set of requests.
In actual tests, dynamic batch processing technology successfully reduced LLM's GPU request throughput by half, thereby significantly reducing operating costs
Another key point is Convert 16-bit precision floating point numbers to 8-bit precision , thereby reducing memory consumption.
Compared with FP16 in the training phase, FP8 has lower resource consumption and is more accurate than INT-8. It can improve performance without affecting the accuracy of the model
Usage Hopper Transformer engine, the system will automatically complete the conversion and compilation of FP16 to FP8, without the need to manually modify any code in the model
Currently, the early bird version of TensorRT-LLM is available for download, and the official version will be launched in a few weeks And integrated into the NeMo framework
One More Thing
Whenever a big event occurs, the figure of "Leewenhoek" is indispensable.
In Nvidia’s announcement, it mentioned cooperation with leading artificial intelligence companies such as Meta, but did not mention OpenAI
From this announcement, some netizens discovered this point and sent it to On the OpenAI forum:
Please let me see who has not been cueed by Lao Huang (manual dog head)
Are you still What kind of "surprises" do we expect Lao Huang to bring us?
The above is the detailed content of Lao Huang gives H100 a boost: Nvidia launches large model acceleration package, doubling Llama2 inference speed. For more information, please follow other related articles on the PHP Chinese website!

ai合并图层的快捷键是“Ctrl+Shift+E”,它的作用是把目前所有处在显示状态的图层合并,在隐藏状态的图层则不作变动。也可以选中要合并的图层,在菜单栏中依次点击“窗口”-“路径查找器”,点击“合并”按钮。

ai橡皮擦擦不掉东西是因为AI是矢量图软件,用橡皮擦不能擦位图的,其解决办法就是用蒙板工具以及钢笔勾好路径再建立蒙板即可实现擦掉东西。

虽然谷歌早在2020年,就在自家的数据中心上部署了当时最强的AI芯片——TPU v4。但直到今年的4月4日,谷歌才首次公布了这台AI超算的技术细节。论文地址:https://arxiv.org/abs/2304.01433相比于TPU v3,TPU v4的性能要高出2.1倍,而在整合4096个芯片之后,超算的性能更是提升了10倍。另外,谷歌还声称,自家芯片要比英伟达A100更快、更节能。与A100对打,速度快1.7倍论文中,谷歌表示,对于规模相当的系统,TPU v4可以提供比英伟达A100强1.

ai可以转成psd格式。转换方法:1、打开Adobe Illustrator软件,依次点击顶部菜单栏的“文件”-“打开”,选择所需的ai文件;2、点击右侧功能面板中的“图层”,点击三杠图标,在弹出的选项中选择“释放到图层(顺序)”;3、依次点击顶部菜单栏的“文件”-“导出”-“导出为”;4、在弹出的“导出”对话框中,将“保存类型”设置为“PSD格式”,点击“导出”即可;

Yann LeCun 这个观点的确有些大胆。 「从现在起 5 年内,没有哪个头脑正常的人会使用自回归模型。」最近,图灵奖得主 Yann LeCun 给一场辩论做了个特别的开场。而他口中的自回归,正是当前爆红的 GPT 家族模型所依赖的学习范式。当然,被 Yann LeCun 指出问题的不只是自回归模型。在他看来,当前整个的机器学习领域都面临巨大挑战。这场辩论的主题为「Do large language models need sensory grounding for meaning and u

ai顶部属性栏不见了的解决办法:1、开启Ai新建画布,进入绘图页面;2、在Ai顶部菜单栏中点击“窗口”;3、在系统弹出的窗口菜单页面中点击“控制”,然后开启“控制”窗口即可显示出属性栏。

ai移动不了东西的解决办法:1、打开ai软件,打开空白文档;2、选择矩形工具,在文档中绘制矩形;3、点击选择工具,移动文档中的矩形;4、点击图层按钮,弹出图层面板对话框,解锁图层;5、点击选择工具,移动矩形即可。

引入密集强化学习,用 AI 验证 AI。 自动驾驶汽车 (AV) 技术的快速发展,使得我们正处于交通革命的风口浪尖,其规模是自一个世纪前汽车问世以来从未见过的。自动驾驶技术具有显着提高交通安全性、机动性和可持续性的潜力,因此引起了工业界、政府机构、专业组织和学术机构的共同关注。过去 20 年里,自动驾驶汽车的发展取得了长足的进步,尤其是随着深度学习的出现更是如此。到 2015 年,开始有公司宣布他们将在 2020 之前量产 AV。不过到目前为止,并且没有 level 4 级别的 AV 可以在市场


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment
