Home  >  Article  >  Technology peripherals  >  PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

WBOY
WBOYforward
2023-04-13 08:34:02982browse

The official version of PyTorch 2.0 is finally here!

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

## Last December, the PyTorch Foundation released the first preview version of PyTorch 2.0 at the PyTorch Conference 2022.

# Compared with the previous version 1.0, 2.0 has undergone subversive changes. In PyTorch 2.0, the biggest improvement is torch.compile.

The new compiler can generate code on-the-fly much faster than the default "eager mode" in PyTorch 1.0, further improving PyTorch performance. .

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

In addition to 2.0, a series of beta updates for the PyTorch domain libraries have been released, including those in the tree library, as well as standalone libraries including TorchAudio, TorchVision and TorchText. Updates to TorchX are also released at the same time to provide community support mode.

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

Highlight Summary

-torch.compile is the main API of PyTorch 2.0, it wraps and returns For compiled models, torch.compile is a completely add-on (and optional) feature, so version 2.0 is 100% backwards compatible.

- As the underlying technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on the OpenAI Triton deep learning compiler to generate high-performance code and Hide low-level hardware details. The performance of kernel implementations generated by OpenAI Triton is comparable to handwritten kernels and specialized CUDA libraries such as cublas.

- Accelerated Transformers introduces high-performance support for training and inference, using a custom kernel architecture to implement Scaled Dot Product Attention (SPDA). The API is integrated with torch.compile(), and model developers can also use scaled dot product attention kernels directly by calling the new scaled_dot_product_attention() operator.

- Metal Performance Shaders (MPS) backend provides GPU-accelerated PyTorch training on the Mac platform and adds support for the top 60 most commonly used operations , covering more than 300 operators.

- Amazon AWS optimizes PyTorch CPU inference on C7g instances based on AWS Graviton3. PyTorch 2.0 improves Graviton’s inference performance compared to previous versions, including improvements to Resnet50 and Bert.

- New prototyping features and techniques across TensorParallel, DTensor, 2D parallel, TorchDynamo, AOTAutograd, PrimTorch and TorchInductor.

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

Compile, still compile!

The latest compiler technologies in PyTorch 2.0 include: TorchDynamo, AOTAutograd, PrimTorch and TorchInductor. All of this is developed in Python, not C (with which Python is compatible).

It also supports dynamic shape, which can send vectors of different sizes without recompiling. It is flexible and easy to learn.

    TorchDynamo​
It can safely obtain PyTorch programs with the help of Python Frame Evaluation Hooks. This major innovation is PyTorch’s safety graph structure capture (safe A summary of research and development results in graph capture).

    AOTAutograd​
Overload the PyTorch autograd engine as a tracing autodiff for generating advanced backward traces.

    PrimTorch​
The 2000 PyTorch operators are summarized into about 250 primitive operator closed sets. Developers can build a complete PyTorch backend. PrimTorch greatly simplifies the process of writing PyTorch functions or backends.

  • TorchInductor​

TorchInductor is a deep learning compiler that can generate fast code for multiple accelerators and backends. For NVIDIA GPUs, it uses OpenAI Triton as a key building block.

The PyTorch Foundation said that the launch of 2.0 will promote "the return from C to Python", adding that this is a substantial new direction for PyTorch.

# "We knew the performance limits of "eager execution" from day one. In July 2017, we started our first research project, developing a compiler for PyTorch. The compiler needs to make PyTorch programs run quickly, but not at the expense of the PyTorch experience, while retaining flexibility and ease of use, so that researchers can use dynamic models and programs at different stages of exploration. "

# Of course, the non-compiled "eager mode" uses a dynamic real-time code generator and is still available in 2.0. Developers can use the porch.compile command to quickly upgrade to compiled mode by adding just one line of code.

# Users can see that the compilation time of 2.0 is increased by 43% compared to 1.0.

This data comes from the PyTorch Foundation’s benchmark test on 163 open source models using PyTorch 2.0 on Nvidia A100 GPU, including image classification, target detection , image generation and other tasks, as well as various NLP tasks.

These Benchmarks are divided into three categories: HuggingFace Transformers, TIMM and TorchBench.

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

##NVIDIA A100 GPU eager mode torch.compile speed-up performance for different models

According to the PyTorch Foundation, the new compiler runs 21% faster when using Float32 precision mode and 51% faster when using automatic mixed precision (AMP) mode.

Among these 163 models, torch.compile can run normally on 93% of the models.

"In the PyTorch 2.x roadmap, we hope to take the compilation model further and further in terms of performance and scalability. There is still some work It didn't start. Some work couldn't be completed due to insufficient bandwidth."

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

Training LLM to speed up 2 times

In addition, performance is another major focus of PyTorch 2.0, and it is also a focus that developers have been generous in promoting.

#In fact, one of the highlights of the new feature is Accelerated Transformers, previously known as Better Transformers.

In addition, the official version of PyTorch 2.0 includes a new high-performance PyTorch TransformAPI implementation.

One of the goals of the PyTorch project is to make training and deployment of state-of-the-art transformer models easier and faster.

Transformers are the basic technology that helps realize the modern era of generative artificial intelligence, including OpenAI models such as GPT-3 and GPT-4.

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

##In PyTorch 2.0 Accelerated Transformers, a custom kernel architecture approach (also known as scaled dot product Attention SDPA), providing high-performance support for training and inference.

Since there are many types of hardware that can support Transformers, PyTorch 2.0 can support multiple SDPA custom kernels. Going a step further, PyTorch integrates custom kernel selection logic that will select the highest performing kernel for a given model and hardware type.

#The impact of the acceleration is significant, as it helps enable developers to train models faster than in previous iterations of PyTorch.

The new version enables high-performance support for training and inference, using a customized kernel architecture to handle Scaled Dot Product Attention (SPDA), extending inference fast path architecture.

Similar to the fastpath architecture, the custom kernel is fully integrated into the PyTorch Transformer API - therefore, using the native Transformer and MultiHeadAttention API will enable users to:

- See significant speed improvements;

- Support more use cases, including using cross-attention Models, Transformer decoders and training models;

# - Continue to use fast path inference for fixed and variable sequence length transformer encoders and self-attention Use cases for force mechanisms.

To take full advantage of different hardware models and Transformer use cases, multiple SDPA custom cores are supported and custom core selection logic will be picked for specific models and hardware types Highest performance core.

#In addition to the existing Transformer API, developers can also directly use scaled dot product attention attention kernels to speed up PyTorch by calling the new scaled_dot_product_attention() operator. 2 Transformers are integrated with torch.compile().

In order to get the additional acceleration of PT2 compilation (for inference or training) while using the model, you can use model = torch.compile(model ) to preprocess the model.

Currently, a combination of custom kernels and torch.compile() has been used to train Transformer models, especially large languages ​​using the accelerated PyTorch 2 Transformer Substantial acceleration has been achieved in models.

PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible

Using a custom kernel and torch.compile to provide significant acceleration for large language model training

Sylvain Gugger, the main maintainer of HuggingFace Transformers, wrote in a statement released by the PyTorch project, "With just one line of code, PyTorch 2.0 can provide 1.5x better performance when training Transformers models. to 2.0x speedup. This is the most exciting thing since the introduction of mixed precision training!"

PyTorch and Google's TensorFlow are the two most popular deep learning framework. Thousands of institutions around the world are using PyTorch to develop deep learning applications, and its usage is growing.

The launch of PyTorch 2.0 will help accelerate the development of deep learning and artificial intelligence applications, said Chief Technology Officer of Lightning AI and one of the main maintainers of PyTorch Lightning Luca Antiga said:

## "PyTorch 2.0 embodies the future of deep learning frameworks. No user intervention is required to capture PyTorch programs, which can be used out of the box. Generation, along with huge device acceleration, this possibility opens up a whole new dimension for AI developers."

References:

https://www.php.cn/link/d6f84c02e2a54908d96f410083beb6e0

https://www.php.cn/link/89b9e0a6f6d1505fe13dea0f18a2dcfa

https://www.php.cn/link/3b2acfe2e38102074656ed938abf4ac3


The above is the detailed content of PyTorch 2.0 official version released! One line of code speeds up 2 times, 100% backwards compatible. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete