使用Pytorch构建变压器模型的综合指南-人工智能-PHP中文网

首页

科技周边

人工智能

使用Pytorch构建变压器模型的综合指南

William Shakespeare

Mar 10, 2025 am 09:30 AM

>本教程的目的是为如何使用Pytorch构建变压器模型提供全面的理解。变压器是现代机器学习中最强大的模型之一。他们彻底改变了该领域，特别是在自然语言处理（NLP）任务中，例如语言翻译和文本摘要。长期的短期内存（LSTM）网络由于能够处理远程依赖和并行计算的能力而被这些任务中的变压器所取代。

构建变压器指南中使用的工具是Pytorch，Pytorch是一个流行的开源机器学习库，以其简单，多功能性和效率而闻名。借助动态计算图和广泛的库，Pytorch已成为机器学习和人工智能领域的研究人员和开发人员的首选。对于那些不熟悉Pytorch的人来说，访问Datacamp的课程，建议使用Pytorch进行深度学习介绍。 Vaswani等人所需要的全部所需的

> Transformers之后，由于其独特的设计和有效性，变形金刚已成为许多NLP任务的基石。

>。

在变压器的核心是注意机制，特别是“自我注意力”的概念，它允许模型称重和优先级输入数据。这种机制使变压器能够管理数据中的长期依赖性。从根本上讲，这是一种加权方案，允许模型在产生输出时专注于输入的不同部分。

>这种机制允许模型考虑输入序列中的不同单词或特征，分配每个单词或一个“权重”，表示其对产生给定输出的重要性。例如，在句子翻译任务中，在翻译特定单词的同时，该模型可能会将更高的注意力权重分配给语法或语义上与目标词相关的单词。这个过程允许变压器在单词或特征之间捕获依赖项，无论其序列与彼此之间的距离如何。

。

变形金刚在NLP领域的影响不能被夸大。他们在许多任务中都表现出了传统模型的表现，证明了以更细微的方式理解和产生人类语言的能力。

为了更深入地了解NLP，Datacamp在Python课程中的自然语言处理简介是推荐的资源。

设置Pytorch

>在构建变压器之前，必须正确设置工作环境。首先，需要安装Pytorch。 Pytorch（当前稳定版本-2.0.1）可以通过PIP或CONDA软件包管理器轻松安装。

对于PIP，请使用命令：

对于conda，请使用命令：>

pip3 install torch torchvision torchaudio

>使用pytorch和cpu友善访问pytorch文档。

此外，对深度学习概念有基本的理解是有益的，因为这些理解对于理解变形金刚的操作至关重要。对于需要进修的人来说，python中的Datacamp课程深度学习是一个宝贵的资源，涵盖了深度学习中的关键概念。

>用pytorch

构建变压器模型

要构建变压器模型以下步骤是必要的：>

导入库和模块

定义基本构建块 - 多头注意力，位置馈送网络，位置编码
构建编码器块
构建解码器块
>组合编码器和解码器层以创建完整的变压器网络

>我们将从导入Pytorch库的核心功能，用于创建神经网络的神经网络模块，用于培训网络的优化模块以及用于处理数据的数据实用程序功能。此外，我们将导入用于数学操作的标准Python数学模块和用于创建复杂对象副本的复制模块。

这些工具为定义模型的体系结构，管理数据和建立培训过程奠定了基础。

2。定义基本构建块：多头注意，位置馈线网络，位置编码

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

多头注意

多头注意机制在序列中计算每对位置之间的注意力。它由捕获输入序列的不同方面的多个“注意力头”组成。

要了解有关多头注意的更多信息，请查看大语模型（LLMS）概念课程的此注意机制部分。

类定义和初始化：>

该类被定义为Pytorch的nn.module的子类。

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

d_model：输入的维度。

num_heads：将输入拆分为。的注意力次数

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

初始化检查d_model是否可以由num_heads除外，然后定义查询，键，值和输出的转换权重。

缩放点产物的注意：

pip3 install torch torchvision torchaudio

>计算注意分数：attn_scores = torch.matmul（q，k.transpose（-2，-1）） / Math.sqrt（self.d_k）。在这里，注意分数是通过取查询（q）和键（k）的点乘积来计算的，然后按键维（d_k）的平方根进行缩放。>
>计算注意力的权重：注意分数通过SoftMax函数传递，以将其转换为总和为1的概率
拆分头：

此方法将输入X重塑为形状（batch_size，num_heads，seq_length，d_k）。它使模型能够同时处理多个注意力头，从而可以进行并行计算。组合头：

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

分别将注意力应用于每个头部后，此方法将结果结合回形状的单个张量（batch_size，seq_length，d_model）。这为进一步处理的结果做准备。

forward方法：

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

正向方法是实际计算发生的地方：

应用线性转换：首先使用初始化中定义的权重通过线性转换。 拆分头：使用split_heads方法将变换后的Q，K，V分为多个头。

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

应用输出转换：最后，组合张量通过输出线性转换。
多头类别封装了变压器模型中常用的多头注意机制。它需要将输入分为多个注意力头，将注意力集中在每个头上，然后将结果组合在一起。通过这样做，模型可以在不同尺度的输入数据中捕获各种关系，从而提高模型的表达能力。

类是Pytorch的NN.Module的子类，这意味着它将继承使用神经网络层所需的所有功能。

初始化：

pip3 install torch torchvision torchaudio

> d_model：模型输入和输出的维度。>
>
self.relu：relu（rectified Linear单元）激活函数，该功能引入了两个线性层之间的非线性。

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

self.fc1（x）：输入首先通过第一个线性层（FC1）。
self.fc2（...）：然后，激活的输出通过第二线性层（FC2），产生最终输出。
>位置编码
>位置编码用于在输入序列中注入每个令牌的位置信息。它使用不同频率的正弦和余弦函数来生成位置编码。

类定义：

该类被定义为Pytorch的NN.模块的子类，允许将其用作标准的Pytorch层。>

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

d_model：模型输入的尺寸。>

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

max_seq_length：预先计算位置编码的序列的最大长度。

pe：一个充满零的张量，将用位置编码填充。

位置：一个张量，包含序列中每个位置的位置索引。> div_term：用于以特定方式扩展位置索引的术语。>

正弦函数应用于偶数索引，余弦函数函数

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):

的奇数索引

forward方法：

>它使用第一个X.Size（1）PE的元素来确保位置编码匹配x。

的实际序列长度

摘要

位置编码类添加了有关令牌在序列中的位置的信息。由于变压器模型缺乏对代币顺序的固有知识（由于其自我发挥机制），因此该类别有助于该模型考虑令牌在序列中的位置。选择使用的正弦函数以使模型可以轻松学习到相对位置，因为它们为序列中的每个位置都产生独特而光滑的编码。

3。构建编码器块

类定义：

pip3 install torch torchvision torchaudio

该类被定义为Pytorch的Nn.模块的子类，这意味着它可以用作Pytorch中神经网络的构建块。

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

初始化：

参数：

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy

d_model：输入的维度。 num_heads：多头注意力中注意力的数量。

>组件：
>

self.self_attn：多头注意机制。 self.feed_forward：位置上的馈送神经网络。

self.dropout：辍学层，用于防止过度拟合，通过在训练过程中随机设置一些激活为零。
forward方法：

>输入：>

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output

x：ecdoder层的输入。

>蒙版：可选的掩码以忽略输入的某些部分。

添加和归一化（注意之后）：将注意力输出添加到原始输入（残留连接），然后使用Norm1。

>前进网络：上一个步骤的输出通过位置馈线向前网络传递。添加＆归一化（进率后）：类似于步骤2，将馈送输出添加到此阶段的输入（残留连接），然后使用norm2。

>输出：返回处理后的张量作为编码层的输出。

摘要：
4。构建解码器块
```
pip3 install torch torchvision torchaudio
```
类定义：
```
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
```
初始化：
>
参数
```
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math
import copy
```
：
d_model：输入的维度。
1. >
2. >
4. 组件
：
self.self_attn：目标序列的多头自我注意机制。
self.cross_attn：参与编码器输出的多头注意机制。
方法
： >输入：
```
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        # Ensure that the model dimension (d_model) is divisible by the number of heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        # Initialize dimensions
        self.d_model = d_model # Model's dimension
        self.num_heads = num_heads # Number of attention heads
        self.d_k = d_model // num_heads # Dimension of each head's key, query, and value
        
        # Linear layers for transforming inputs
        self.W_q = nn.Linear(d_model, d_model) # Query transformation
        self.W_k = nn.Linear(d_model, d_model) # Key transformation
        self.W_v = nn.Linear(d_model, d_model) # Value transformation
        self.W_o = nn.Linear(d_model, d_model) # Output transformation
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Calculate attention scores
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided (useful for preventing attention to certain parts like padding)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        # Softmax is applied to obtain attention probabilities
        attn_probs = torch.softmax(attn_scores, dim=-1)
        
        # Multiply by values to obtain the final output
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        # Reshape the input to have num_heads for multi-head attention
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        # Combine the multiple heads back to original shape
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        # Apply linear transformations and split heads
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        # Perform scaled dot-product attention
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Combine heads and apply output transformation
        output = self.W_o(self.combine_heads(attn_output))
        return output
```
x：解码器层的输入。
> enc_output：来自相应的encoder的输出（在跨注意步骤中使用）。
> src_mask：源蒙版忽略了编码器输出的某些部分。
1. > tgt_mask：目标蒙版忽略了解码器输入的某些部分。
3. 处理步骤：
4. 对目标序列的自我注意：输入X是通过自我注意的机制来处理的。
  
  添加和归一化（自我注意力之后）：自我注意的输出被添加到原始X中，然后使用Norm1。与编码器输出的交叉注意：上一步的归一化输出是通过跨意义机制处理的，该机制可用于Encoder的输出ENC_OUTPUT。
  添加和归一化（跨注意事后）：跨注意的输出添加到此阶段的输入中，然后使用Norm2进行辍学和归一化。
  >前进网络：上一个步骤的输出通过馈电网络传递。
  添加和归一化（进率后）：将进纸输出输出添加到此阶段的输入中，然后使用Norm3进行辍学和归一化。
  >输出：返回处理后的张量作为解码器层的输出。
  
  摘要：
  
  解码器类定义了变压器解码器的单层。它由多头自我发挥机制，一种多头跨注意机制（符合编码器的输出），位置馈送前向前向神经网络以及相应的残留连接，层归一化和辍学层组成。这种组合使解码器可以根据目标序列和源序列来基于编码器的表示产生有意义的输出。与编码器一样，通常将多个解码器层堆叠以形成变压器模型的完整解码器部分。接下来，将编码器和解码器块汇总在一起以构建综合变压器模型。
  
  5。组合编码器和解码器层以创建完整的变压器网络
  
  >
  
  类定义：
  
  初始化：>
  构造函数采用以下参数：
  pip3 install torch torchvision torchaudio
  >
  > src_vocab_size：源词汇大小。
  > tgt_vocab_size：目标词汇大小。
  conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  
  d_model：模型嵌入的尺寸。> num_heads：多头注意机制中注意力头的数量。
  
  import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy
  num_layers：编码器和解码器的层数。
  
  d_ff：馈送网络中内层的维度。
  >
  max_seq_length：位置编码的最大序列长度。
  
  辍学：正规化的辍学率。
  
  >它定义了以下组件：
  >
  
  self.encoder_embedding：源序列的嵌入层。
  
  self.decoder_embedding：目标序列的嵌入层。
  
  self.positional_encoding：位置编码组件。
  
  self.encoder_layers：编码层的列表。
  >
  self.decoder_layers：解码器层列表。
  
  self.fc：最终完全连接（线性）层映射到目标词汇大小。
  
  self.dropout：辍学层。
  
  生成蒙版方法：>
  该方法用于为源和目标序列创建掩码，以确保忽略填充令牌，并且在训练目标序列的训练过程中看不到未来的令牌。
  pip3 install torch torchvision torchaudio
  
  forward方法：
  
  此方法定义了变压器的正向通行证，采用源和目标序列并产生输出预测。
  
  conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  >输入嵌入和位置编码：首先使用其各自的嵌入层嵌入源和目标序列，然后添加到其位置编码中。
  >编码器层：源序列通过编码层传递，最终的编码器输出代表已处理的源序列。
  >解码器层：目标序列和编码器的输出通过解码器层传递，从而导致解码器的输出。
  最终线性层：解码器的输出使用完全连接的（线性）层映射到目标词汇大小。
  
  输出：
  
  最终输出是代表模型对目标序列的预测的张量。
  
  摘要：
  变压器类将变压器模型的各个组件汇总在一起，包括嵌入，位置编码，编码器层和解码器层。它提供了一个方便的界面，用于训练和推理，封装了多头关注，进率向前网络和层归一化的复杂性。
  >
  >此实现遵循标准变压器体系结构，使其适合于序列到序列任务，例如机器翻译，文本摘要等。包含掩蔽可确保该模型遵守序列中的因果关系，忽略填充令牌，并防止未来的代币泄漏。
  这些顺序步骤使变压器模型有效地处理输入序列并产生相应的输出序列。
  训练Pytorch变压器模型样本数据准备
  出于说明目的，将在此示例中制作一个虚拟数据集。但是，在实际情况下，将采用更实质性的数据集，并且该过程将涉及文本预处理以及为源和目标语言创建词汇映射。
  pip3 install torch torchvision torchaudio
  
  超参数：
  
  这些值定义了变压器模型的体系结构和行为：
  
  > src_vocab_size，tgt_vocab_size：源和目标序列的词汇大小，都设置为5000。d_model：模型嵌入的维度，设置为512。
  
  num_heads：多头注意机制中的注意力头数，设置为8。
  
  num_layers：编码器和解码器的图层数，设置为6。
  d_ff：馈线网络中内层的维度，设置为2048。
  >
  max_seq_length：位置编码的最大序列长度，设置为100。
  辍学：正规化的辍学率，设置为0.1。
  
  创建一个变压器实例：
  
  >
  >此行创建了变压器类的实例，并用给定的超参数初始化它。该实例将具有这些超参数定义的架构和行为。>
  
  conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
  生成随机示例数据：
  >
  以下几行生成随机源和目标序列：
  > src_data：1和src_vocab_size之间的随机整数，代表具有形状的一批源序列（64，max_seq_length）。
  
  > tgt_data：1和tgt_vocab_size之间的随机整数，代表具有形状的一批目标序列（64，max_seq_length）。
  
  这些随机序列可以用作变压器模型的输入，模拟了一批具有64个示例和长度序列的数据。
  
  摘要：
  
  >代码段演示了如何初始化变压器模型并生成可以馈入模型的随机源和目标序列。所选的超参数确定变压器的特定结构和特性。此设置可能是较大脚本的一部分，其中对模型进行了对实际顺序到序列任务进行训练和评估，例如机器翻译或文本摘要。
  >
  训练模型 接下来，将使用上述样本数据训练该模型。但是，在现实世界中，将采用更大的数据集，通常将其划分为不同的集合，以进行培训和验证目的。
  >
  损失功能和优化器：
  >
  
  > criterion = nn.Crossentropyloss（ignore_index = 0）：将损耗函数定义为跨内向损失。 ignore_index参数设置为0，这意味着损失不会考虑索引为0的目标（通常用于填充令牌）。
  >
  优化器= optim.Adam（...）：将优化器定义为ADAM，学习率为0.0001和特定的beta值。
  import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy
  
  模型训练模式：
  
  > transformer.train（）：将变压器模型设置为训练模式，从而实现只有在训练期间适用的辍学等行为。
  
  训练环：
  
  代码代码使用典型的训练循环训练100个时期的模型：
  
  范围（100）的时期：迭代100多个训练时期。态
  output = transformer（src_data，tgt_data [：，：-1]）：通过变压器传递源数据和目标数据（每个序列中的最后一个令牌）。这在序列到序列任务中很常见，其中目标通过一个令牌移动。
  损失=标准（...）：计算模型的预测与目标数据之间的损失（不包括每个序列中的第一个令牌）。通过将数据重塑为一维张量并使用交叉渗透损失函数来计算损失。
  
  lose.backward（）：计算相对于模型参数的损失梯度。
  
  > importizer.step（）：使用计算的梯度更新模型的参数。
  
  print（f“ epoch：{epoch 1}，损失：{loss.item（）}”）：打印当前的时期数和该时代的损失值。
  
  摘要：
  
  此代码片段在100个时期的随机生成源和目标序列上训练变压器模型。它使用ADAM优化器和横向渗透损失函数。每个时期都打印损失，使您可以监视培训进度。在现实世界中，您将用任务中的实际数据替换随机源和目标序列，例如机器翻译。 >变压器模型性能评估
  
  训练模型后，可以在验证数据集或测试数据集上评估其性能。以下是如何完成此操作的一个示例：
  >
  
  评估模式：
  
  pip3 install torch torchvision torchaudio
  
  transformer.eval（）：将变压器模型置于评估模式。这很重要，因为它关闭了仅在训练期间使用的某些行为（例如辍学）。
  
  生成随机验证数据：
  
  val_src_data：1和src_vocab_size之间的随机整数，代表具有形状的一批验证源序列（64，max_seq_length）。 val_tgt_data：1和tgt_vocab_size之间的随机整数，代表具有形状的一批验证目标序列（64，max_seq_length）。
  
  >验证环：
  
  >：禁用梯度计算，因为我们不需要在验证过程中计算梯度。这可以减少记忆消耗并加快计算加速。
  
  > val_output =变形金刚（val_src_data，val_tgt_data [：，： - 1]）：通过变压器传递验证源数据和验证源数据和验证目标数据（每个顺序中的最后一个令牌）。
  > val_loss = Criterion（...）：计算模型的预测与验证目标数据之间的损失（不包括每个序列中的第一个令牌）。通过将数据重塑为一维张量并使用先前定义的跨透明损失函数来计算损失。
  
  print（f“验证损失：{val_loss.item（）}”）：打印验证损失值。
  >
  
  摘要：
  此代码段评估随机生成的验证数据集上的变压器模型，计算验证损失并打印它。在实际情况下，应从您正在处理的任务中替换随机验证数据。验证损失可以使您表明您的模型在看不见的数据上的性能，这是对模型概括能力的关键衡量。
  >有关变压器和拥抱面孔的更多详细信息，我们的教程，使用变压器和拥抱面的介绍是有用的。
  结论和进一步的资源
  总之，该教程演示了如何使用Pytorch构建变压器模型，Pytorch是最通用的深度学习工具之一。凭借其并行的能力和捕获数据中的长期依赖性的能力，变形金刚在各个领域都具有巨大的潜力，尤其是NLP任务，例如翻译，摘要和情感分析。
  渴望加深对先进深度学习概念和技术的理解的人，请考虑使用Datacamp上的Keras探索课程。您还可以在单独的教程中使用Pytorch构建简单的神经网络。
  >获得顶级AI认证
  >证明您可以有效，负责任地使用AI。