構建變壓器指南中使用的工具是Pytorch,Pytorch是一個流行的開源機器學習庫,以其簡單,多功能性和效率而聞名。借助動態計算圖和廣泛的庫,Pytorch已成為機器學習和人工智能領域的研究人員和開發人員的首選。 對於那些不熟悉Pytorch的人來說,訪問Datacamp的課程,建議使用Pytorch進行深度學習介紹。 Vaswani等人所需要的全部所需的> Transformers之後,由於其獨特的設計和有效性,變形金剛已成為許多NLP任務的基石。
>在構建變壓器之前,必須正確設置工作環境。首先,需要安裝Pytorch。 Pytorch(當前穩定版本-2.0.1)可以通過PIP或CONDA軟件包管理器輕鬆安裝。
pip3 install torch torchvision torchaudio
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia多頭注意
import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy
num_heads:將輸入拆分為。class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() # Ensure that the model dimension (d_model) is divisible by the number of heads assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # Initialize dimensions self.d_model = d_model # Model's dimension self.num_heads = num_heads # Number of attention heads self.d_k = d_model // num_heads # Dimension of each head's key, query, and value # Linear layers for transforming inputs self.W_q = nn.Linear(d_model, d_model) # Query transformation self.W_k = nn.Linear(d_model, d_model) # Key transformation self.W_v = nn.Linear(d_model, d_model) # Value transformation self.W_o = nn.Linear(d_model, d_model) # Output transformation def scaled_dot_product_attention(self, Q, K, V, mask=None): # Calculate attention scores attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # Apply mask if provided (useful for preventing attention to certain parts like padding) if mask is not None: attn_scores = attn_scores.masked_fill(mask == 0, -1e9) # Softmax is applied to obtain attention probabilities attn_probs = torch.softmax(attn_scores, dim=-1) # Multiply by values to obtain the final output output = torch.matmul(attn_probs, V) return output def split_heads(self, x): # Reshape the input to have num_heads for multi-head attention batch_size, seq_length, d_model = x.size() return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2) def combine_heads(self, x): # Combine the multiple heads back to original shape batch_size, _, seq_length, d_k = x.size() return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model) def forward(self, Q, K, V, mask=None): # Apply linear transformations and split heads Q = self.split_heads(self.W_q(Q)) K = self.split_heads(self.W_k(K)) V = self.split_heads(self.W_v(V)) # Perform scaled dot-product attention attn_output = self.scaled_dot_product_attention(Q, K, V, mask) # Combine heads and apply output transformation output = self.W_o(self.combine_heads(attn_output)) return output
pip3 install torch torchvision torchaudio
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy
應用線性轉換:首先使用初始化中定義的權重通過線性轉換。 拆分頭:使用split_heads方法將變換後的Q,K,V分為多個頭。
class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() # Ensure that the model dimension (d_model) is divisible by the number of heads assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # Initialize dimensions self.d_model = d_model # Model's dimension self.num_heads = num_heads # Number of attention heads self.d_k = d_model // num_heads # Dimension of each head's key, query, and value # Linear layers for transforming inputs self.W_q = nn.Linear(d_model, d_model) # Query transformation self.W_k = nn.Linear(d_model, d_model) # Key transformation self.W_v = nn.Linear(d_model, d_model) # Value transformation self.W_o = nn.Linear(d_model, d_model) # Output transformation def scaled_dot_product_attention(self, Q, K, V, mask=None): # Calculate attention scores attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # Apply mask if provided (useful for preventing attention to certain parts like padding) if mask is not None: attn_scores = attn_scores.masked_fill(mask == 0, -1e9) # Softmax is applied to obtain attention probabilities attn_probs = torch.softmax(attn_scores, dim=-1) # Multiply by values to obtain the final output output = torch.matmul(attn_probs, V) return output def split_heads(self, x): # Reshape the input to have num_heads for multi-head attention batch_size, seq_length, d_model = x.size() return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2) def combine_heads(self, x): # Combine the multiple heads back to original shape batch_size, _, seq_length, d_k = x.size() return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model) def forward(self, Q, K, V, mask=None): # Apply linear transformations and split heads Q = self.split_heads(self.W_q(Q)) K = self.split_heads(self.W_k(K)) V = self.split_heads(self.W_v(V)) # Perform scaled dot-product attention attn_output = self.scaled_dot_product_attention(Q, K, V, mask) # Combine heads and apply output transformation output = self.W_o(self.combine_heads(attn_output)) return output應用縮放點產物的注意:在拆分頭上調用scaled_dot_product_antertion方法。
pip3 install torch torchvision torchaudio
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidiax:饋送網絡的輸入。
import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy初始化:
d_model:模型輸入的尺寸。class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() # Ensure that the model dimension (d_model) is divisible by the number of heads assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # Initialize dimensions self.d_model = d_model # Model's dimension self.num_heads = num_heads # Number of attention heads self.d_k = d_model // num_heads # Dimension of each head's key, query, and value # Linear layers for transforming inputs self.W_q = nn.Linear(d_model, d_model) # Query transformation self.W_k = nn.Linear(d_model, d_model) # Key transformation self.W_v = nn.Linear(d_model, d_model) # Value transformation self.W_o = nn.Linear(d_model, d_model) # Output transformation def scaled_dot_product_attention(self, Q, K, V, mask=None): # Calculate attention scores attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # Apply mask if provided (useful for preventing attention to certain parts like padding) if mask is not None: attn_scores = attn_scores.masked_fill(mask == 0, -1e9) # Softmax is applied to obtain attention probabilities attn_probs = torch.softmax(attn_scores, dim=-1) # Multiply by values to obtain the final output output = torch.matmul(attn_probs, V) return output def split_heads(self, x): # Reshape the input to have num_heads for multi-head attention batch_size, seq_length, d_model = x.size() return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2) def combine_heads(self, x): # Combine the multiple heads back to original shape batch_size, _, seq_length, d_k = x.size() return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model) def forward(self, Q, K, V, mask=None): # Apply linear transformations and split heads Q = self.split_heads(self.W_q(Q)) K = self.split_heads(self.W_k(K)) V = self.split_heads(self.W_v(V)) # Perform scaled dot-product attention attn_output = self.scaled_dot_product_attention(Q, K, V, mask) # Combine heads and apply output transformation output = self.W_o(self.combine_heads(attn_output)) return outputmax_seq_length:預先計算位置編碼的序列的最大長度。
class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads):的奇數索引
pip3 install torch torchvision torchaudio
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia初始化:
import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy
d_model:輸入的維度。 num_heads:多頭注意力中註意力的數量。
>self.self_attn:多頭注意機制。 self.feed_forward:位置上的饋送神經網絡。
class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() # Ensure that the model dimension (d_model) is divisible by the number of heads assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # Initialize dimensions self.d_model = d_model # Model's dimension self.num_heads = num_heads # Number of attention heads self.d_k = d_model // num_heads # Dimension of each head's key, query, and value # Linear layers for transforming inputs self.W_q = nn.Linear(d_model, d_model) # Query transformation self.W_k = nn.Linear(d_model, d_model) # Key transformation self.W_v = nn.Linear(d_model, d_model) # Value transformation self.W_o = nn.Linear(d_model, d_model) # Output transformation def scaled_dot_product_attention(self, Q, K, V, mask=None): # Calculate attention scores attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # Apply mask if provided (useful for preventing attention to certain parts like padding) if mask is not None: attn_scores = attn_scores.masked_fill(mask == 0, -1e9) # Softmax is applied to obtain attention probabilities attn_probs = torch.softmax(attn_scores, dim=-1) # Multiply by values to obtain the final output output = torch.matmul(attn_probs, V) return output def split_heads(self, x): # Reshape the input to have num_heads for multi-head attention batch_size, seq_length, d_model = x.size() return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2) def combine_heads(self, x): # Combine the multiple heads back to original shape batch_size, _, seq_length, d_k = x.size() return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model) def forward(self, Q, K, V, mask=None): # Apply linear transformations and split heads Q = self.split_heads(self.W_q(Q)) K = self.split_heads(self.W_k(K)) V = self.split_heads(self.W_v(V)) # Perform scaled dot-product attention attn_output = self.scaled_dot_product_attention(Q, K, V, mask) # Combine heads and apply output transformation output = self.W_o(self.combine_heads(attn_output)) return outputx:ecdoder層的輸入。
>前進網絡:上一個步驟的輸出通過位置饋線向前網絡傳遞。 添加&歸一化(進率後):類似於步驟2,將饋送輸出添加到此階段的輸入(殘留連接),然後使用norm2。
pip3 install torch torchvision torchaudio類定義:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia初始化:
參數import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy:
class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() # Ensure that the model dimension (d_model) is divisible by the number of heads assert d_model % num_heads == 0, "d_model must be divisible by num_heads" # Initialize dimensions self.d_model = d_model # Model's dimension self.num_heads = num_heads # Number of attention heads self.d_k = d_model // num_heads # Dimension of each head's key, query, and value # Linear layers for transforming inputs self.W_q = nn.Linear(d_model, d_model) # Query transformation self.W_k = nn.Linear(d_model, d_model) # Key transformation self.W_v = nn.Linear(d_model, d_model) # Value transformation self.W_o = nn.Linear(d_model, d_model) # Output transformation def scaled_dot_product_attention(self, Q, K, V, mask=None): # Calculate attention scores attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) # Apply mask if provided (useful for preventing attention to certain parts like padding) if mask is not None: attn_scores = attn_scores.masked_fill(mask == 0, -1e9) # Softmax is applied to obtain attention probabilities attn_probs = torch.softmax(attn_scores, dim=-1) # Multiply by values to obtain the final output output = torch.matmul(attn_probs, V) return output def split_heads(self, x): # Reshape the input to have num_heads for multi-head attention batch_size, seq_length, d_model = x.size() return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2) def combine_heads(self, x): # Combine the multiple heads back to original shape batch_size, _, seq_length, d_k = x.size() return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model) def forward(self, Q, K, V, mask=None): # Apply linear transformations and split heads Q = self.split_heads(self.W_q(Q)) K = self.split_heads(self.W_k(K)) V = self.split_heads(self.W_v(V)) # Perform scaled dot-product attention attn_output = self.scaled_dot_product_attention(Q, K, V, mask) # Combine heads and apply output transformation output = self.W_o(self.combine_heads(attn_output)) return outputx:解碼器層的輸入。
> enc_output:來自相應的encoder的輸出(在跨注意步驟中使用)。
> src_mask:源蒙版忽略了編碼器輸出的某些部分。
pip3 install torch torchvision torchaudio>
> src_vocab_size:源詞彙大小。
> tgt_vocab_size:目標詞彙大小。conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
d_model:模型嵌入的尺寸。 > num_heads:多頭注意機制中註意力頭的數量。
import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copynum_layers:編碼器和解碼器的層數。
pip3 install torch torchvision torchaudio
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia>輸入嵌入和位置編碼:首先使用其各自的嵌入層嵌入源和目標序列,然後添加到其位置編碼中。
這些順序步驟使變壓器模型有效地處理輸入序列並產生相應的輸出序列。訓練Pytorch變壓器模型 樣本數據準備
出於說明目的,將在此示例中製作一個虛擬數據集。但是,在實際情況下,將採用更實質性的數據集,並且該過程將涉及文本預處理以及為源和目標語言創建詞彙映射。pip3 install torch torchvision torchaudio
>此行創建了變壓器類的實例,並用給定的超參數初始化它。該實例將具有這些超參數定義的架構和行為。 >
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia生成隨機示例數據:
以下幾行生成隨機源和目標序列:> src_data:1和src_vocab_size之間的隨機整數,代表具有形狀的一批源序列(64,max_seq_length)。
> tgt_data:1和tgt_vocab_size之間的隨機整數,代表具有形狀的一批目標序列(64,max_seq_length)。
訓練模型 接下來,將使用上述樣本數據訓練該模型。但是,在現實世界中,將採用更大的數據集,通常將其劃分為不同的集合,以進行培訓和驗證目的。
優化器= optim.Adam(...):將優化器定義為ADAM,學習率為0.0001和特定的beta值。import torch import torch.nn as nn import torch.optim as optim import torch.utils.data as data import math import copy
此代碼片段在100個時期的隨機生成源和目標序列上訓練變壓器模型。它使用ADAM優化器和橫向滲透損失函數。每個時期都打印損失,使您可以監視培訓進度。在現實世界中,您將用任務中的實際數據替換隨機源和目標序列,例如機器翻譯。 >變壓器模型性能評估
pip3 install torch torchvision torchaudio
val_src_data:1和src_vocab_size之間的隨機整數,代表具有形狀的一批驗證源序列(64,max_seq_length)。 val_tgt_data:1和tgt_vocab_size之間的隨機整數,代表具有形狀的一批驗證目標序列(64,max_seq_length)。
渴望加深對先進深度學習概念和技術的理解的人,請考慮使用Datacamp上的Keras探索課程。您還可以在單獨的教程中使用Pytorch構建簡單的神經網絡。 >獲得頂級AI認證>證明您可以有效,負責任地使用AI。