Home > Article > Technology peripherals > Hand-tearing Llama3 layer 1: Implementing llama3 from scratch
In this series of articles, we implement llama3 from scratch.
The overall architecture of Llama3:
Picture
The model parameters of Llama3:
Let’s take a look at these The actual values of the parameters in the LlaMa 3 model.
Picture
When instantiating the LlaMa class, the variable max_seq_len defines context -window. There are other parameters in the class, but this parameter is most directly related to the transformer model. The max_seq_len here is 8K.
Picture
Transformer class is a A model with defined vocabulary and number of layers. Vocabulary here refers to the set of words (and tokens) that the model is able to recognize and process. Attention layers refer to the transformer block (a combination of attention and feed-forward layers) used in the model.
Picture
Based on these numbers, LlaMa 3 has a vocabulary of 128K, which is quite large. Furthermore, it has 32 transformer blocks.
[3] Feature-dimension and attention-heads
Feature-dimension and attention-heads are introduced into the Self-Attention module. Feature dimension refers to the vector size of tokens in the embedding space (feature dimension refers to the dimension size of the input data or embedding vector), while attention-heads include the QK-module that drives the self-attention mechanism in transformers.
Picture
Hidden dimensions refer to the feed forward neural network (Feed Forward) , the dimension size of the hidden layer. Feedforward neural networks usually contain one or more hidden layers, and the dimensions of these hidden layers determine the capacity and complexity of the network. In the Transformer model, the hidden layer dimension of the feedforward neural network is usually a multiple of the feature dimension to increase the representation ability of the model. In LLama3, the hidden dimension is 1.3 times the feature dimension. It should be noted that hidden layers and hidden dimensions are two concepts.
A higher number of hidden layers allows the network to internally create and manipulate richer representations before projecting them back into smaller output dimensions.
Picture
The first matrix is the input feature matrix, which is processed through the Attention layer Generate Attention Weighted features. In this image, the input feature matrix is only 5 x 3 in size, but in the real Llama 3 model it grows to 8K x 4096, which is huge.
Next are the hidden layers in the Feed-Forward Network, growing to 5325 and then falling back to 4096 in the last layer.
Picture
LlaMa 3 combines the above 32 transformer blocks, and the output is from one block Pass to the next block until the last one is reached.
Picture
Once we have all the above parts started, it’s time to put them together Together, see how they create the LlaMa effect.
Picture
Step 1: First we have our input matrix with size 8K(context-window) x 128K(vocabulary-size). This matrix undergoes an embedding process to convert this high-dimensional matrix into a low-dimensional one.
Step 2: In this case, this low-dimensional result becomes 4096, which is the specified dimension of the features in the LlaMa model we saw earlier.
In neural networks, dimensionality enhancement and dimensionality reduction are common operations, and they each have different purposes and effects.
Dimensionality increase is usually to increase the capacity of the model so that it can capture more complex features and patterns. When the input data is mapped into a higher dimensional space, different feature combinations can be more easily distinguished by the model. This is especially useful when dealing with non-linear problems, as it can help the model learn more complex decision boundaries.
Dimensionality reduction is to reduce the complexity of the model and the risk of overfitting. By reducing the dimensionality of the feature space, the model can be forced to learn more refined and generalized feature representations. In addition, dimensionality reduction can be used as a regularization method to help improve the generalization ability of the model. In some cases, dimensionality reduction can also reduce computational costs and improve model operating efficiency.
In practical applications, the strategy of dimensionality increase and then dimensionality reduction can be regarded as a process of feature extraction and transformation. In this process, the model first explores the intrinsic structure of the data by increasing the dimensionality, and then extracts the most useful features and patterns by reducing the dimensionality. This method can help the model avoid overfitting to the training data while maintaining sufficient complexity.
Step 3: This feature is processed through the Transformer block, first by the Attention layer, and then by the FFN layer. The Attention layer processes across features horizontally, while the FFN layer processes across dimensions vertically.
Step 4: Step 3 is repeated for the 32 layers of the Transformer block. Finally, the dimensions of the resulting matrix are the same as those used for the feature dimensions.
Step 5: Finally, this matrix is converted back to the original vocabulary matrix size, which is 128K, so that the model can select and map the words available in the vocabulary.
This is how LlaMa 3 scores high on those benchmarks and creates the LlaMa 3 effect.
We will summarize several terms that are easily confused in a short language:
This is the model’s single processing time The maximum number of tokens that can be accepted.
In the LlaMa 3-8B model, this parameter is set to 8,000 tokens, that is, Context Window Size = 8K. This means that the maximum number of tokens the model can consider in a single processing is 8,000. This is critical for understanding long texts or maintaining the context of long-term conversations.
This is the number of all different tokens that the model can recognize. This includes all possible words, punctuation, and special characters. The vocabulary of the model is 128,000, expressed as Vocabulary-size = 128K. This means that the model is able to recognize and process 128,000 different tokens, which include various words, punctuation marks, and special characters.
A main component in the Transformer model. It is mainly responsible for processing input data by learning which parts of the input data are most important (i.e. which tokens are "attended"). A model may have multiple such layers, each trying to understand the input data from a different perspective.
The LlaMa 3-8B model contains 32 processing layers, that is, Number of Layers = 32. These layers include multiple Attention Layers and other types of network layers, each of which processes and understands the input data from a different perspective.
Contains modules of multiple different layers, usually including at least one Attention Layer and a Feed-Forward Network (feed-forward network). A model can have multiple transformer blocks. These blocks are connected sequentially, and the output of each block is the input of the next block. The transformer block can also be called a decoder layer.
In the context of the Transformer model, usually we say that the model has "32 layers", which can be equivalent to saying that the model has "32 Transformer blocks". Each Transformer block usually contains a self-attention layer and a feed-forward neural network layer. These two sub-layers together form a complete processing unit or "layer".
Therefore, when we say that the model has 32 Transformer blocks, we are actually describing that the model is composed of 32 such processing units, each unit has the ability to perform self-attention processing and pre-processing of data. Feed network processing. This presentation emphasizes the hierarchical structure of the model and its processing capabilities at each level.
In summary, "32 layers" and "32 Transformer blocks" are basically synonymous when describing the Transformer model structure. They both mean that the model contains 32 independent data processing cycles, and each cycle includes Self-attention and feedforward network operations.
This is the dimension of each vector when the input token is represented as a vector in the model.
Each token is converted into a vector containing 4096 features in the model, that is, Feature-dimension = 4096. This high dimension enables the model to capture richer semantic information and contextual relationships.
In each Attention Layer, there can be multiple Attention-Heads, and each head independently analyzes the input data from different perspectives.
Each Attention Layer contains 32 independent Attention Heads, that is, Number of Attention Heads = 32. These heads analyze input data from different aspects and jointly provide more comprehensive data analysis capabilities.
This usually refers to the width of the layer in the Feed-Forward Network, that is, the number of neurons in each layer. Typically, Hidden Dimensions will be larger than Feature-dimension, which allows the model to create a richer data representation internally.
In Feed-Forward Networks, the dimension of the hidden layer is 5325, that is, Hidden Dimensions = 5325. This is larger than the feature dimension, allowing the model to perform deeper feature translation and learning between internal layers.
Relationship between Attention Layers and Attention-Heads: Each Attention Layer can contain multiple Attention-Heads.
Numerical relationship: A model may have multiple transformer blocks, each block contains an Attention Layer and one or more other layers. Each Attention Layer may have multiple Attention-Heads. In this way, the entire model performs complex data processing in different layers and heads.
Download the official link script of the Llama3 model: https://llama.meta.com/llama-downloads/
The following code shows How to use the tiktoken library to load and use a Byte Pair Encoding (BPE)-based tokenizer. This tokenizer is designed to process text data, especially for use in natural language processing and machine learning models.
We enter hello world and see how the word segmenter performs word segmentation.
from pathlib import Pathimport tiktokenfrom tiktoken.load import load_tiktoken_bpeimport torchimport jsonimport matplotlib.pyplot as plttokenizer_path = "Meta-Llama-3-8B/tokenizer.model"special_tokens = ["<|begin_of_text|>","<|end_of_text|>","<|reserved_special_token_0|>","<|reserved_special_token_1|>","<|reserved_special_token_2|>","<|reserved_special_token_3|>","<|start_header_id|>","<|end_header_id|>","<|reserved_special_token_4|>","<|eot_id|>",# end of turn] + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)]mergeable_ranks = load_tiktoken_bpe(tokenizer_path)tokenizer = tiktoken.Encoding(name=Path(tokenizer_path).name,pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",mergeable_ranks=mergeable_ranks,special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},)tokenizer.decode(tokenizer.encode("hello world!"))
##Picture
Read model filemodel = torch.load("Meta-Llama-3-8B/consolidated.00.pth")print(json.dumps(list(model.keys())[:20], indent=4))
Picture
Picture
Overall, this output reveals the key components of a deep learning model based on the Transformer architecture. This model is widely used in natural language processing tasks, such as text classification, machine translation, question answering systems, etc. The structure of each layer is almost the same, including attention mechanism, feed-forward network and normalization layer, which helps the model capture complex input sequence features. View the parameter configuration of the Llama3 model:with open("Meta-Llama-3-8B/params.json", "r") as f:config = json.load(f)config
图片
我们使用这个配置来推断模型的细节,比如:
dim = config["dim"]n_layers = config["n_layers"]n_heads = config["n_heads"]n_kv_heads = config["n_kv_heads"]vocab_size = config["vocab_size"]multiple_of = config["multiple_of"]ffn_dim_multiplier = config["ffn_dim_multiplier"]norm_eps = config["norm_eps"]rope_theta = torch.tensor(config["rope_theta"])
图片
代码如下:
prompt = "the answer to the ultimate question of life, the universe, and everything is "tokens = [128000] + tokenizer.encode(prompt)print(tokens)tokens = torch.tensor(tokens)prompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]print(prompt_split_as_tokens)
[128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']
截止到目前,我们的[17x1]令牌现在变成了[17x4096],即长度为4096的17个嵌入(每个令牌一个)。
下图是为了验证我们输入的这句话,是17个token。
图片
代码如下:
embedding_layer = torch.nn.Embedding(vocab_size, dim)embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)token_embeddings_unnormalized.shape
图片
我们接着使用 RMS 归一化对嵌入进行归一化,也就是图中这个位置:
图片
使用公式如下:
图片
代码如下:
# def rms_norm(tensor, norm_weights):# rms = (tensor.pow(2).mean(-1, keepdim=True) + norm_eps)**0.5# return tensor * (norm_weights / rms)def rms_norm(tensor, norm_weights):return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights
这段代码定义了一个名为 rms_norm 的函数,它实现了对输入张量(tensor)的RMS(Root Mean Square,均方根)归一化处理。这个函数接受两个参数:tensor 和 norm_weights。tensor 是需要进行归一化处理的输入张量,而 norm_weights 是归一化时使用的权重。
函数的工作原理如下:
在进行归一化处理后,我们的数据形状仍然保持为 [17x4096],这与嵌入层的形状相同,只不过数据已经过归一化。
token_embeddings = rms_norm(token_embeddings_unnormalized, model["layers.0.attention_norm.weight"])token_embeddings.shape
图片
图片
接下来,我们介绍注意力机制的实现,也就是下图中的红框标注的位置:
图片
图片
计算 ( Q ) 和 ( K ) 的点积。
对点积结果进行缩放。
应用softmax函数得到注意力权重。
用注意力权重乘以值矩阵 ( V ) 得到输出矩阵 ( Z )。
这张图展示了Transformer模型中多头注意力机制的实现过程,从输入句子的嵌入开始,经过多头分割、注意力计算,最后拼接结果并生成输出。每个步骤都详细说明了如何从输入矩阵 ( X ) 生成最终的输出矩阵 ( Z )。
当我们从模型中加载查询(query)、键(key)、值(value)和输出(output)向量时,我们注意到它们的形状分别是 [4096x4096]、[1024x4096]、[1024x4096]、[4096x4096]
乍一看这很奇怪,因为理想情况下我们希望每个头的每个q、k、v和o都是单独的
print(model["layers.0.attention.wq.weight"].shape,model["layers.0.attention.wk.weight"].shape,model["layers.0.attention.wv.weight"].shape,model["layers.0.attention.wo.weight"].shape)
Picture
The shape of the Query weight matrix (wq.weight) is [4096, 4096]. The shape of the key weight matrix (wk.weight) is [1024, 4096]. The shape of the value weight matrix (wv.weight) is [1024, 4096]. The shape of the output (Output) weight matrix (wo.weight) is [4096, 4096]. The output results show that the shapes of the query (Q) and output (O) weight matrices are the same, both [4096, 4096]. This means that both the input feature and the output feature have dimensions of 4096 for both query and output. The shapes of the key (K) and value (V) weight matrices are also the same, both [1024, 4096]. This shows that the input feature dimensions for keys and values are 4096, but the output feature dimensions are compressed to 1024. The shape of these weight matrices reflects how the model designer sets the dimensions of different parts of the attention mechanism. In particular, the dimensions of keys and values are reduced probably to reduce computational complexity and memory consumption, while keeping queries and outputs higher in dimensionality may be to retain more information. This design choice depends on the specific model architecture and application scenario
Let us use the sentence "I admire Li Hongzhang" as an example to simplify the implementation process of explaining the attention mechanism in this figure. Enter the sentence: First, we have the sentence "I admire Li Hongzhang". Before processing this sentence, we need to convert each word in the sentence into a mathematically processable form, that is, a word vector. This process is called word embedding.
Word embedding: Each word, such as "I", "appreciation", and "Li Hongzhang", will be converted into a fixed-size vector. These vectors contain the semantic information of the words.
Split into multiple heads: In order to allow the model to understand the sentence from different perspectives, we split the vector of each word into multiple parts, here are 8 heads. Each head focuses on a different aspect of the sentence.
Calculate attention: For each head, we will calculate something called attention. This process involves three steps: Take "I appreciate Li Hongzhang" as an example. If we want to focus on the word "appreciation", then "appreciation" is the query, and other words such as "I" and "Li Hongzhang" are keys. The vector of is the value.
Query (Q): This is the part where we want to find information. Key (K): This is the part that contains the information. Value (V): This is the actual information content. Splicing and output: After calculating the attention of each head, we concatenate these results and generate the final output through a weight matrix Wo. This output will be used in the next layer of processing or as part of the final result.
The shape problem mentioned in the comments to the figure is about how to store and process these vectors efficiently in a computer. In actual code implementation, in order to improve efficiency, developers may package the query, key, and value vectors of multiple headers together instead of processing each header individually. This can take advantage of the parallel processing capabilities of modern computers to speed up calculations.
The output results show that:
The shape of these weight matrices reflects how the model designer sets the dimensions of different parts of the attention mechanism. In particular, the dimensions of keys and values are reduced probably to reduce computational complexity and memory consumption, while keeping queries and outputs higher in dimensionality may be to retain more information. This design choice depends on the specific model architecture and application scenario
Let us use the sentence "I admire Li Hongzhang" as an example to simplify the implementation process of explaining the attention mechanism in this figure.
Query (Q): This is the part where we want to find information.
Key (K): This is the part that contains information.
Value (V): This is the actual information content.
The shape problem mentioned in the comments to the figure is about how to store and process these vectors efficiently in a computer. In actual code implementation, in order to improve efficiency, developers may package the query, key, and value vectors of multiple headers together instead of processing each header individually. This can take advantage of the parallel processing capabilities of modern computers to speed up calculations.
We continue to use the sentence "I appreciate Li Hongzhang" to explain the role of the weight matrices WQ, WK, WV and WO.
In the Transformer model, each word is converted into a vector through word embedding. These vectors are then passed through a series of linear transformations to calculate attention scores. These linear transformations are implemented through the weight matrices WQ, WK, WV and WO.
In the whole process, WQ, WK, WV and WO are learned through training. They determine how the model converts the input word vectors into different representations and how to combine these representations. Get the final output. These matrices are the core part of the attention mechanism in the Transformer model, and they enable the model to capture the relationship between different words in the sentence.
WQ (weight matrix Q), WK (weight matrix K), WV (weight matrix V) and WO (weight matrix O) are the parameters in the Transformer model. They are used in model training. In the process, it is learned through optimization methods such as backpropagation algorithm and gradient descent.
Let’s take a look at how this learning process works:
在本小节中,我们将从多个注意力头中展开查询向量,得到的形状是 [32x128x4096] 这里,32 是 llama3 中注意力头的数量,128 是查询向量的大小,而 4096 是令牌嵌入的大小。
q_layer0 = model["layers.0.attention.wq.weight"]head_dim = q_layer0.shape[0] // n_headsq_layer0 = q_layer0.view(n_heads, head_dim, dim)q_layer0.shape
图片
这段代码通过对模型中第一层的查询(Q)权重矩阵进行重塑(reshape),将其分解为多个注意力头的形式,从而揭示了32和128这两个维度。
之所以在这段代码中出现了32和128这两个维度,而在之前的代码段中没有,是因为这段代码通过重塑操作明确地将查询权重矩阵分解为多个注意力头,每个头具有自己的维度。32代表了模型中注意力头的数量,而128代表了分配给每个头的特征维度大小。这种分解是为了实现多头注意力机制,其中每个头可以独立地关注输入的不同部分,最终通过组合这些头的输出来提高模型的表达能力。
访问了第一层第一个头的查询(query)权重矩阵,这个查询权重矩阵的大小是 [128x4096]。
q_layer0_head0 = q_layer0[0]q_layer0_head0.shape
图片
在这里,你可以看到结果形状是 [17x128],这是因为我们有17个令牌,每个令牌都有一个长度为128的查询(每个令牌在一个头上方的查询)。
br
Picture
This code performs a matrix multiplication operation to combine the token embeddings (token_embeddings) with the query (query) weight of the first header of the first layer The transpose (.T) of the matrix (q_layer0_head0) is multiplied to generate the per-token query vector (q_per_token).
torch.matmul is the matrix multiplication function in PyTorch, which can handle two tensors multiplication.
token_embeddings should be a tensor of shape [17, 4096], indicating that there are 17 tokens, each token is represented by a 4096-dimensional embedding vector.
q_layer0_head0 is the query weight matrix of the first head of the first layer, and its original shape is [128, 4096]. .T is the transpose operation in PyTorch, which transposes the shape of q_layer0_head0 to [4096, 128].
In this way, the matrix multiplication of token_embeddings and q_layer0_head0.T is the multiplication of [17, 4096] and [4096, 128], and the result is a tensor with shape [17, 128].
This line of code prints out the shape of the q_per_token tensor, confirming that it is [ 17, 128].
This means that for every token entered (17 in total), we now have a 128-dimensional query vector. This 128-dimensional query vector is obtained by multiplying the token embedding and the query weight matrix and can be used for subsequent attention mechanism calculations.
In short, this code converts the embedding vector of each token into a query vector through matrix multiplication, preparing for the next step of implementing the attention mechanism. Each token now has a query vector corresponding to it, and these query vectors will be used to calculate attention scores with other tokens.
The above is the detailed content of Hand-tearing Llama3 layer 1: Implementing llama3 from scratch. For more information, please follow other related articles on the PHP Chinese website!