首頁 >後端開發 >Python教學 >GPU 模式講座 1 的筆記

GPU 模式講座 1 的筆記

DDD原創: 2024-11-17 19:21:021096瀏覽

Notes on GPU-Mode lecture 1

分析器

電腦效能取決於時間和記憶體的權衡。由於計算設備比較昂貴，所以大多數時候，時間是首先要關心的。

為什麼要使用分析器？

CUDA 是異步的，因此無法使用 Python 時間模組
分析器更強大

工具

共有三個分析器：

autograd 分析器：數值
Pytorch 分析器：視覺
NVIDIA Nsight 計算

Autograd 分析器利用 torch.cuda.Event() 來測量效能。

PyTorch profiler 利用 Profiler 上下文管理器 torch.profiler 中的 profile() 方法來分析效能。
您可以將結果匯出為 .json 檔案並將其上傳到 chrome://tracing/ 進行視覺化。

示範

課程提供了一個簡單的程式來展示如何使用autograd profiler來分析三種平方運算方法的表現：

透過 torch.square()
由 ** 操作員
由 * 操作員

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

下面的結果是在 NVIDIA T4 GPU 上完成的。

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

事實證明：

CUDA 運算速度比 CPU 更快。
* 運算子執行的是 aten::multiply 操作，而不是 aten::pow，且前者更快。這可能是因為乘法比 pow 使用得更多，而且許多開發人員花時間進行最佳化。
CUDA 上的效能差異很小。考慮到 CPU 時間，torch.square 是最慢的操作
aten::square 是對 aten::pow 的調用
所有三種方法都啟動了一個名為 native::vectorized_elementwise_kernel

在 PyTorch 中整合 CUDA 內核

有幾種方法可以做到這一點：

使用torch.utils.cpp_extendsion中的load_inline
使用 Numba，它是一個編譯器，可將經過修飾的 Python 函數編譯為在 CPU 和 GPU 上運行的機器碼
使用 Triton

我們可以使用torch.utils.cpp_extendsion中的load_inline透過load_inline（name，cpp_sources，cuda_sources，functions，with_cuda，build_directory）將CUDA核心載入為PyTorch擴充。

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

動手實踐

對均值操作使用 autograd 分析器

使用 autograd profiler 時，請記住：

錄製前預熱GPU，使GPU進入穩定狀態
平均多次運行以獲得更可靠的結果

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

使用 Pytorch 分析器進行平均值操作

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

為 torch.mean() 實作 triton 程式碼

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

參考

GPU 模式講座 - Github
活動 - Pytorch
PyTorch 分析器
NVIDIA Nsight 計算
torch.utils.cpp_extension.load_inline
海衛一

以上是GPU 模式講座 1 的筆記的詳細內容。更多資訊請關注PHP中文網其他相關文章！

Python json chrome for using operator Event function github pytorch

陳述：

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

上一篇：為什麼我的 Python 子程序在讀取 C 程式輸出時掛起？下一篇：為什麼我的 Python 子程序在讀取 C 程式輸出時掛起？

看更多