search
HomeTechnology peripheralsAIThe PyTorch team re-implemented the 'split everything” model eight times faster than the original implementation

From the beginning of the year to now, generative AI has developed rapidly. But many times, we have to face a difficult problem: how to speed up the training, reasoning, etc. of generative AI, especially when using PyTorch.

In this article, researchers from the PyTorch team provide us with a solution. The article focuses on how to use pure native PyTorch to accelerate generative AI models. It also introduces new PyTorch features and practical examples of how to combine them.

What is the result? The PyTorch team said they rewrote Meta's "Split Everything" (SAM) model, resulting in code that is 8 times faster than the original implementation without losing accuracy, all optimized using native PyTorch.

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Blog address: https://pytorch.org/blog/accelerating-generative-ai/

After reading this article, you will gain the following understanding:

  • Torch.compile: PyTorch model compiler, PyTorch 2.0 has added a new function called torch .compile () can accelerate existing models with one line of code;
  • GPU quantization: accelerate the model by reducing the calculation accuracy;
  • SDPA (Scaled Dot Product Attention): A memory-efficient attention implementation;
  • Semi-structured (2:4) Sparseness: A sparse memory format optimized for GPUs ;
  • Nested Tensor: Nested Tensor packs {tensor, mask} together to batch non-uniformly sized data into a single tensor, such as images of different sizes;
  • Triton Custom Operations: Use the Triton Python DSL to write GPU operations and easily integrate them into various components of PyTorch through custom operator registration.

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Increased throughput and reduced memory overhead brought about by PyTorch’s native features.

For more information about this research, please refer to the SAM proposed by Meta. Detailed articles can be found in "CV no longer exists? Meta releases "Split Everything" AI model, CV may usher in GPT-3 moment"

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Next, we will introduce the SAM optimization process, including performance analysis, bottleneck identification, and how to integrate these new features into PyTorch to solve the problems faced by SAM. In addition, we will also introduce some new features of PyTorch, including torch.compile, SDPA, Triton kernels, Nested Tensor and semi-structured sparsity (semi-structured sparsity)

content step by step Going deeper, this article will introduce the fast version of SAM at the end. For interested readers, you can download it from GitHub. In addition, these data were visualized using Perfetto UI to demonstrate the application value of various features of PyTorch

GitHub address: https://github.com/pytorch-labs/segment The source code for this project can be found at -anything-fast

A rewrite of the Segmented Everything model SAM

The study points out that the The SAM baseline data type is float32 dtype, the batch size is 1, and the results of using PyTorch Profiler to view the core tracing are as follows:

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

This article found that SAM has two places that can be optimized:

The first is the long call to aten::index, which is performed by the tensor index operation (such as []) Caused by the underlying calls generated. However, the actual time the GPU spends on aten::index is relatively low. The reason is that during the process of starting two cores, aten::index blocks cudaStreamSynchronize between the two. This means that the CPU waits for the GPU to finish processing until the second core is launched. Therefore, in order to optimize SAM, this paper believes that one should strive to eliminate blocking GPU synchronization that causes idle time.

The second problem is that SAM spends a lot of GPU time in matrix multiplication (dark green part as shown in the picture), which is very common in Transformers model. If we can reduce the GPU time of the SAM model on matrix multiplication, then we can significantly improve the speed of SAM

Next, we will take the throughput of SAM (img/s ) and memory overhead (GiB) to establish a baseline. Then there is the optimization process

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

The sentence that needs to be rewritten is: Bfloat16 half precision (plus GPU synchronization and Batch processing)

#In order to solve the above problem, that is, to reduce the time required for matrix multiplication, this article turns to bfloat16. bfloat16 is a commonly used half-precision type. By reducing the precision of each parameter and activation, it can save a lot of computing time and memory

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation


##Replace fill type with bfloat16

In addition, this article found that there are two places that can be optimized to remove GPU synchronization

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation


The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Specifically, it is easier to understand based on the picture above, the study found In the SAM image encoder, there are two variables q_coords and k_coords that act as coordinate scalers, and these variables are allocated and processed on the CPU. However, once these variables are used to index in rel_pos_resized, the indexing operation automatically moves these variables to the GPU, causing GPU synchronization issues. In order to solve this problem, the research pointed out that this part can be rewritten using the torch.where function to solve the problem, as shown above

Core Tracking

After applying these changes, we noticed that there was a noticeable time gap between individual kernel calls, especially with small batch sizes (here 1). To gain a deeper understanding of this phenomenon, we begin performance analysis of SAM inference with a batch size of 8

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

at When analyzing the time spent per kernel, we notice that most of the GPU time for SAM is spent on element-wise kernels and softmax operations

#We can now see the relative small overhead of matrix multiplication A lot.

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Combining GPU synchronization and bfloat16 optimization, SAM performance is improved by 3x.

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Torch.compile(graph breaks and CUDA graphs)

Discovered during the study of SAM Many small operations were performed. Researchers believe that using a compiler to integrate these operations is very beneficial, so PyTorch made the following optimizations to torch.compile

  • Integrate sequences of operations such as nn.LayerNorm or nn.GELU into a single GPU kernel;
  • Fuse operations immediately following the matrix multiplication kernel to reduce the number of GPU kernel calls.

Through these optimizations, the research reduced the number of GPU global memory roundtrips, thereby speeding up inference. We can now try torch.compile on SAM’s image encoder. To maximize performance, this article uses some advanced compilation techniques:

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Core Tracking

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

## According to the results, torch.compile performs very well

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

It can be observed that softmax takes up a large part of the time, followed by various GEMM variants . The following measurements are for batch sizes of 8 and above.

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

##SDPA: scaled_dot_product_attention

Next, this article discusses SDPA ( scaled_dot_product_attention) conducted experiments, focusing on the attention mechanism. In general, native attention mechanisms scale quadratically with sequence length in time and memory. PyTorch's SDPA operations are built on the memory-efficient attention principles of Flash Attention, FlashAttentionV2, and xFormer, which can significantly speed up GPU attention. Combined with torch.compile, this operation allows the expression and fusion of a common pattern in variants of MultiheadAttention. After a small change, the model can now use scaled_dot_product_attention.

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Core Tracking

Now available to watch The memory-efficient attention kernel takes up a lot of computation time on the GPU:

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

##Using PyTorch’s native scaled_dot_product_attention, batch sizes can be significantly increased. The graph below shows the changes for batch sizes of 32 and above.

###############Next, the study was conducted on Triton, NestedTensor, batch Predict_torch, int8 quantization, semi-structured (2:4) sparsity Experiments on other operations############For example, this article uses a custom positional Triton kernel and observes measurement results with a batch size of 32. #####################Adopt Nested Tensor technology and adjust the batch size to 32 and above################ #####After adding quantization, the measurement results vary with batch size of 32 and above. ######################The end of the article is semi-structured sparsity. The study shows that matrix multiplication is still a bottleneck that needs to be faced. The solution is to use sparsification to approximate matrix multiplication. By sparse matrices (i.e. zeroing out the values) fewer bits can be used to store weights and activation tensors. The process of setting which weights in a tensor is set to zero is called pruning. Pruning out smaller weights can potentially reduce model size without significant loss of accuracy. ######

There are many ways to prune, ranging from completely unstructured to highly structured. While unstructured pruning theoretically has minimal impact on accuracy, in the sparse case the GPU may experience significant performance degradation, despite being very efficient when doing large dense matrix multiplications. One pruning method recently supported by PyTorch is semi-structured (or 2:4) sparsity, which aims to find a balance. This sparse storage method reduces the original tensor by 50% while producing a dense tensor output. Please refer to the figure below for explanation

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

In order to use this sparse storage format and the associated fast kernel, the next thing to do is to prune the weights. This article selects the smallest two weights for pruning at a sparsity of 2:4. Changing the weights from the default PyTorch ("strided") layout to this new semi-structured sparse layout is easy. To implement apply_sparse (model), only 32 lines of Python code are needed:

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

When the sparsity is 2:4, we observe that vit_b and SAM peak performance with batch size 32

The PyTorch team re-implemented the split everything” model eight times faster than the original implementation

Finally, the summary of this article is as follows: This article describes the implementation on PyTorch so far The fastest way to Segment Anything, with the help of a series of officially released new features, this article rewrites the original SAM in pure PyTorch without losing accuracy

For interested readers , you can check the original blog for more information

The above is the detailed content of The PyTorch team re-implemented the 'split everything” model eight times faster than the original implementation. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Build Your Personal AI Assistant with Huggingface SmolLMHow to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityAI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentThe 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaComprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesFirst Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsWhat Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsGoogle Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Will R.E.P.O. Have Crossplay?
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Atom editor mac version download

Atom editor mac version download

The most popular open source editor