집 >백엔드 개발 >파이썬 튜토리얼 >GPU-Mode 강의 노트 1

GPU-Mode 강의 노트 1

DDD원래의: 2024-11-17 19:21:021090검색

Notes on GPU-Mode lecture 1

프로파일러

컴퓨터 성능은 시간과 메모리의 균형에 달려 있습니다. 계산 장치는 가격이 훨씬 비싸기 때문에 대부분 시간을 우선적으로 고려합니다.

프로파일러를 사용하는 이유는 무엇인가요?

CUDA는 비동기식이므로 Python 시간 모듈을 사용할 수 없습니다
프로파일러가 훨씬 더 강력해졌습니다

도구

세 가지 프로파일러가 있습니다.

autograd 프로파일러: 숫자
Pytorch 프로파일러: 시각적
NVIDIA Nsight 컴퓨팅

Autograd 프로파일러는 torch.cuda.Event()를 활용하여 성능을 측정합니다.

PyTorch 프로파일러는 성능을 분석하기 위해 프로파일러 컨텍스트 관리자 torch.profiler의 profile() 메소드를 활용합니다.
결과를 .json 파일로 내보낸 후 chrome://tracing/에 업로드하여 시각화할 수 있습니다.

데모

이 과정에서는 autograd 프로파일러를 사용하여 제곱 연산을 수행하는 세 가지 방법의 성능을 분석하는 방법을 보여주는 간단한 프로그램을 제공합니다.

torch.square() 작성
** 운영자
* 연산자 기준

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

아래 결과는 NVIDIA T4 GPU에서 이루어졌습니다.

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

결과는 다음과 같습니다.

CUDA 연산이 CPU보다 빠릅니다.
* 연산자는 aten::pow가 아닌 aten::multiply 연산을 수행하며 전자가 더 빠릅니다. 아마도 pow보다 곱셈을 더 많이 사용하고 많은 개발자들이 이를 최적화하는데 시간을 투자하기 때문일 것입니다.
CUDA의 성능 차이는 미미합니다. torch.square는 CPU 시간을 고려하면 가장 느린 작업입니다
aten::square는 aten::pow에 대한 호출입니다.
세 가지 방법 모두 네이티브::벡터화_요소별_커널<4라는 cuda 커널을 시작했습니다.

PyTorch에 CUDA 커널 통합

다음과 같은 몇 가지 방법이 있습니다.

torch.utils.cpp_extendsion의 load_inline 사용
장식된 Python 함수를 CPU와 GPU 모두에서 실행되는 기계어 코드로 컴파일하는 컴파일러인 Numba를 사용하세요
트리톤을 사용하세요

torch.utils.cpp_extendsion의 load_inline을 사용하면 load_inline(name, cpp_sources, cuda_sources, function, with_cuda, build_directory)을 통해 CUDA 커널을 PyTorch 확장으로 로드할 수 있습니다.

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

실습

평균 작업 시 autograd 프로파일러 사용

autograd 프로파일러를 사용할 때 다음 사항을 기억하세요.

GPU가 안정적인 상태로 들어가도록 녹화하기 전에 GPU를 워밍업하세요
보다 안정적인 결과를 위한 평균 다중 실행

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

평균 작업에 Pytorch 프로파일러 사용

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

torch.mean()에 대한 트리톤 코드 구현

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

참조

Gpu 모드 강의 - Github
이벤트 - 파이토치
PyTorch 프로파일러
NVIDIA Nsight 컴퓨팅
torch.utils.cpp_extension.load_inline
트리톤

위 내용은 GPU-Mode 강의 노트 1의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

Python json chrome for using operator Event function github pytorch

성명：

이전 기사：C 프로그램 출력을 읽을 때 Python 하위 프로세스가 중단되는 이유는 무엇입니까?다음 기사：C 프로그램 출력을 읽을 때 Python 하위 프로세스가 중단되는 이유는 무엇입니까?