Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100

Home

Technology peripherals

Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source

PHPz

Apr 29, 2024 pm 04:55 PM

dataModelOpen sourcedata accessarrangement

FP8 and lower floating point quantification precision are no longer the "patent" of H100!

Lao Huang wanted everyone to use INT8/INT4, but the Microsoft DeepSpeed team forcedly started running FP6 on A100 without official support from NVIDIA.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

Test results show that the new method TC-FPx’s FP6 quantization speed on A100 is close to or even occasionally exceeding INT4, and has better performance than the latter High precision.

On this basis, there is also end-to-end large model support, which has been open sourced and integrated into deep learning inference frameworks such as DeepSpeed.

This result also has an immediate effect on the acceleration of large models - under this framework, using a single card to run Llama, the throughput is 2.65 times higher than that of dual cards.

After reading it, a machine learning researcher said that Microsoft’s research can be described as crazy.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

emoticons were also launched immediately, be like:

NVIDIA: Only H100 supports FP8.

Microsoft: Fine, I’ll do it myself.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

#So, what kind of effects can this framework achieve, and what kind of technology is used behind it?

Use FP6 to run Llama, a single card is faster than dual cards

Using FP6 precision on A100 brings kernel-level performance improvement.

The researchers selected linear layers in Llama models and OPT models of different sizes, and tested them using CUDA 11.8 on the NVIDIA A100-40GB GPU platform.

The results are compared to NVIDIA's official cuBLAS(W16A16) and TensorRT-LLM(W8A16), TC-FPx(W6A16) is faster The maximum value of degree improvement is 2.6 times and 1.9 times respectively.

Compared with the 4bit BitsandBytes(W4A16) method, the maximum speed increase of TC-FPx is 8.9 times.

(W and A represent the weight quantization bit width and activation quantization bit width respectively)

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

△Normalized data, with The cuBLAS result is 1

At the same time, the TC-FPx core also reduces access to DRAM memory and improves DRAM bandwidth utilization and Tensor Cores utilization, as well as ALU and FMA unit utilization.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

The end-to-end inference framework FP6-LLM designed on the basis of TC-FPx also brings benefits to large models. Comes significant performance improvements. Taking Llama-70B as an example, the throughput of using FP6-LLM on a single card is 2.65 times higher than that of FP16 on dual cards, and the latency in batch sizes below 16 is also lower. In FP16.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源 As for the model OPT-30B with a smaller number of parameters (FP16 also uses a single card), FP6-LLM also brings significant throughput improvement and latency reduction.

Moreover, the maximum batch size supported by a single card FP16 under this condition is only 4, but FP6-LLM can operate normally with a batch size of 16.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源 So, how did the Microsoft team realize FP16 quantification running on A100?

Redesign the kernel solution

In order to support precision including 6bit, the TC-FPx team designed a unified kernel solution that can support quantization weights of different bit widths.

Compared with the traditional dual-core method, TC-FPx reduces the number of memory accesses and improves performance by integrating dequantization and matrix multiplication in a single core.

The core secret of achieving low-precision quantization is to "disguise" FP6-precision data as FP16 through de-quantization, and then hand it over to the GPU for calculation in the FP16 format.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

At the same time, the team also used bit-level pre-packaging technology to solve the problem of GPU memory system for non-power of 2 bit width (such as 6 -bit) unfriendly question.

Specifically, bit-level pre-packing is the reorganization of weight data before model inference, including rearranging 6-bit quantized weights so that they can be accessed in a GPU memory system-friendly manner.

In addition, since GPU memory systems usually access data in 32-bit or 64-bit blocks, bit-level pre-packing technology will also pack 6-bit weights so that they can be stored in the form of these aligned blocks. and access.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

After the pre-packaging is completed, the research team uses the parallel processing capabilities of the SIMT core to perform parallel dequantization on the FP6 weights in the register to generate weights in FP16 format.

The dequantized FP16 weights are reconstructed in the register and then sent to the Tensor Core. The reconstructed FP16 weights are used to perform matrix multiplication operations to complete the calculation of the linear layer.

In this process, the team took advantage of the bit-level parallelism of the SMIT core to improve the efficiency of the entire dequantization process.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

In order to enable the weight reconstruction task to run in parallel, the team also used a parallel weight splicing technology.

Specifically, each weight is divided into several parts, and the bit width of each part is a power of 2 (such as dividing 6 into 2 4 or 4 2) .

Before dequantizing, the weights are first loaded into registers from shared memory. Since each weight is split into multiple parts, the complete weight needs to be reconstructed at the register level at runtime.

In order to reduce runtime overhead, TC-FPx proposes a method of parallel extraction and splicing of weights. This approach uses two sets of registers to store segments of 32 FP6 weights, reconstructing these weights in parallel.

At the same time, in order to extract and splice weights in parallel, it is necessary to ensure that the initial data layout meets specific order requirements, so TC-FPx rearranges the weight fragments before running.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

In addition, TC-FPx also designed a software pipeline, which combines the dequantization step with the matrix multiplication operation of Tensor Core Together, the overall execution efficiency is improved through instruction-level parallelism.

单卡跑Llama 70B快过双卡，微软硬生生把FP6搞到了A100里 | 开源

Paper address: https://arxiv.org/abs/2401.14112

The above is the detailed content of Single card running Llama 70B is faster than dual card, Microsoft forced FP6 into A100 | Open source. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles