首页 >后端开发 >Python教程 >GPU 模式讲座 1 的笔记

GPU 模式讲座 1 的笔记

DDD原创: 2024-11-17 19:21:021087浏览

Notes on GPU-Mode lecture 1

分析器

计算机性能取决于时间和内存的权衡。由于计算设备比较昂贵，所以大多数时候，时间是首先要关心的。

为什么要使用分析器？

CUDA 是异步的，因此无法使用 Python 时间模块
分析器更加强大

工具

共有三个分析器：

autograd 分析器：数值
Pytorch 分析器：视觉
NVIDIA Nsight 计算

Autograd 分析器利用 torch.cuda.Event() 来测量性能。

PyTorch profiler 利用 Profiler 上下文管理器 torch.profiler 中的 profile() 方法来分析性能。
您可以将结果导出为 .json 文件并将其上传到 chrome://tracing/ 进行可视化。

演示

课程提供了一个简单的程序来展示如何使用autograd profiler来分析三种平方运算方法的性能：

通过 torch.square()
由 ** 操作员
由 * 操作员

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

下面的结果是在 NVIDIA T4 GPU 上完成的。

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

事实证明：

CUDA 运算速度比 CPU 更快。
* 运算符执行的是 aten::multiply 操作，而不是 aten::pow，并且前者更快。这可能是因为乘法比 pow 使用得更多，并且许多开发人员花时间对其进行优化。
CUDA 上的性能差异很小。考虑到 CPU 时间，torch.square 是最慢的操作
aten::square 是对 aten::pow 的调用
所有三种方法都启动了一个名为 native::vectorized_elementwise_kernel

在 PyTorch 中集成 CUDA 内核

有几种方法可以做到这一点：

使用torch.utils.cpp_extendsion中的load_inline
使用 Numba，它是一个编译器，可将经过修饰的 Python 函数编译为在 CPU 和 GPU 上运行的机器代码
使用 Triton

我们可以使用torch.utils.cpp_extendsion中的load_inline通过load_inline（name，cpp_sources，cuda_sources，functions，with_cuda，build_directory）将CUDA内核加载为PyTorch扩展。

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

动手实践

对均值操作使用 autograd 分析器

使用 autograd profiler 时，请记住：

录制前预热GPU，使GPU进入稳定状态
平均多次运行以获得更可靠的结果

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

使用 Pytorch 分析器进行均值操作

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

为 torch.mean() 实现 triton 代码

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

参考

GPU 模式讲座 - Github
活动 - Pytorch
PyTorch 分析器
NVIDIA Nsight 计算
torch.utils.cpp_extension.load_inline
海卫一

以上是GPU 模式讲座 1 的笔记的详细内容。更多信息请关注PHP中文网其他相关文章！

Python json chrome for using operator Event function github pytorch

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇：Why Does My Python Subprocess Hang When Reading C Program Output?下一篇：Why Doesn't pygame.event.get() Return Events in a Separate Thread?

查看更多