This is the experience gained by the author Sebastian Raschka after hundreds of experiments. It is worth reading.
Increasing the amount of data and model parameters is recognized as the most direct way to improve the performance of neural networks. At present, the number of parameters of mainstream large models has expanded to hundreds of billions, and the trend of "large models" becoming larger and larger will become more and more intense. This trend has brought about many computing power challenges. If you want to fine-tune a large language model with hundreds of billions of parameters, it not only takes a long time to train, but also requires a lot of high-performance memory resources. In order to "bring down" the cost of fine-tuning large models, Microsoft researchers developed low-rank adaptive (LoRA) technology. The subtlety of LoRA is that it is equivalent to adding a detachable plug-in to the original large model, and the main body of the model remains unchanged. LoRA is plug-and-play, lightweight and convenient. LoRA is one of the most widely used methods and one of the most effective methods for efficiently fine-tuning a customized version of a large language model. If you are interested in open source LLM, LoRA is a basic technology worth learning and should not be missed. Sebastian Raschka, a data science professor from the University of Wisconsin-Madison, also explored LoRA in all aspects. Having explored the field of machine learning for many years, he is very passionate about breaking down complex technical concepts. After hundreds of experiments, Sebastian Raschka summarized his experience in using LoRA to fine-tune large models and published it in the magazine Ahead of AI.
On the basis of retaining the author's original intention, this site has compiled this article: Last month, I shared an article about the LoRA experiment , mainly based on the open source Lit-GPT library that my colleagues and I maintain at Lightning AI, discuss the main experiences and lessons learned from my experiments. Additionally, I will answer some frequently asked questions related to LoRA technology. If you're interested in fine-tuning custom large language models, I hope these insights will help you get started quickly. In short, the main points I discuss in this article include:
- Although LLM training (or all models trained on GPU) has inevitable randomness, the results of multi-lun training are still very consistent.
- QLoRA provides a cost-effective compromise if you are limited by GPU memory. It saves 33% of memory at the cost of 39% increase in runtime.
- When fine-tuning an LLM, the choice of optimizer is not the main factor affecting the results. Whether it is AdamW, SGD with scheduler, or AdamW with scheduler, the impact on the results is minimal.
- Although Adam is often considered a memory-intensive optimizer because it introduces two new parameters for each model parameter, this does not significantly affect the peak memory of LLM need. This is because most of the memory will be allocated for multiplication of large matrices, rather than holding extra parameters.
- For static data sets, multiple iterations like multiple rounds of training may not work well. This often leads to overfitting, worsening training results.
- If you want to incorporate LoRA, make sure it is applied on all layers, not just the Key and Value matrices, to maximize the performance of the model.
- It is crucial to adjust LoRA rank and choose the appropriate α value. As a tip, try setting the α value to twice the rank value.
- A single GPU with 14GB of RAM can efficiently fine-tune large models with up to 7 billion parameters in a matter of hours. For static data sets, it is impossible to strengthen LLM into an "all-around player" and perform well in all baseline tasks. Solving this problem requires diversifying data sources or using technologies other than LoRA.
Also, I will answer ten frequently asked questions about LoRA. If readers are interested, I will write another more comprehensive introduction to LoRA, including detailed code to implement LoRA from scratch. Today’s article mainly shares key issues in the use of LoRA. Before we officially start, let’s add some basic knowledge. Due to GPU memory limitations, the model is updated during the training process Weights are costly. For example, suppose we have a 7B parameter language model, represented by a weight matrix W. During backpropagation, the model needs to learn a ΔW matrix, aiming to update the original weights to minimize the loss function value. The weights are updated as follows: W_updated = W ΔW. If the weight matrix W contains 7B parameters, the weight update matrix ΔW also contains 7B parameters. Calculating the matrix ΔW is very computationally and memory-consuming. LoRA proposed by Edward Hu et al. decomposes the part of weight variation ΔW into a low-rank representation. Specifically, it does not require an explicit calculation of ΔW. Instead, LoRA learns the decomposed representation of ΔW during training, as shown in the figure below. This is the secret of LoRA saving computational resources.
As shown above, the decomposition of ΔW means that we need to use two smaller LoRA matrices A and B to represent the larger matrix ΔW. If A has the same number of rows as ΔW and B has the same number of columns as ΔW, we can write the above decomposition as ΔW = AB. (AB is the result of matrix multiplication between matrices A and B.) How much memory does this approach save? It also depends on the rank r, which is a hyperparameter. For example, if ΔW has 10,000 rows and 20,000 columns, 200,000,000 parameters need to be stored. If we choose A and B with r=8, then A has 10,000 rows and 8 columns and B has 8 rows and 20,000 columns, which is 10,000×8 8×20,000 = 240,000 parameters, which is about 830 times less than 200,000,000 parameters. Of course, A and B cannot capture all the information covered by ΔW, but this is determined by the design of LoRA. When using LoRA, we assume that the model W is a large matrix with full rank to collect all the knowledge in the pre-training dataset. When we fine-tune LLM, we do not need to update all weights, but only need to update fewer weights than ΔW to capture the core information. This is how low-rank update is implemented through the AB matrix. Although LLM, or is trained on GPU The randomness of the model is inevitable, but using LoRA to conduct multiple experiments, the final benchmark results of LLM showed amazing consistency in different test sets. This is a good basis for conducting other comparative studies.
Please note that these results were obtained under default settings, using a small value of r=8. Experimental details can be found in my other article. Article link: https://lightning.ai/pages/community/lora-insights/QLoRA Computation - Memory TradeoffQLoRA is the abbreviation of quantized LoRA proposed by Tim Dettmers et al. QLoRA is a technique to further reduce the memory footprint during fine-tuning. During backpropagation, QLoRA quantizes the pretrained weights into 4-bit and uses a paging optimizer to handle memory peaks. I found that I can save 33% of GPU memory when using LoRA. However, training time increases by 39% due to the additional quantization and dequantization of pretrained model weights in QLoRA. Default LoRA has 16 bit floating point precision:
- Training time: 1.85 hours
##QLoRA with 4-digit normal float
- Training duration is: 2.79h
- Memory usage is: 14.18GB
In addition, I found that the performance of the model was almost unaffected, which shows that QLoRA can be used as an alternative to LoRA training to further solve the common GPU memory bottleneck problem.
The learning rate scheduler will be used during the entire training process Reduce the learning rate to optimize the convergence of the model and avoid excessive loss values.
Cosine annealing is a scheduler that follows a cosine curve to adjust the learning rate. It starts with a higher learning rate and then decreases smoothly, gradually approaching 0 in a cosine-like pattern. A common variant of cosine annealing is the half-period variant, where only half a cosine cycle is completed during training, as shown in the figure below.
In experiments, I added a cosine annealing scheduler to the LoRA fine-tuning script, which significantly improved the performance of SGD. However, its benefit to the Adam and AdamW optimizers is small, and there is almost no change after adding it.
In the next section, the potential advantages of SGD over Adam will be discussed. Adam and AdamW optimizers are popular in deep learning. If we are training a 7B parameter model, using Adam can track an additional 14B parameters during the training process, which is equivalent to doubling the number of parameters of the model when other conditions remain unchanged. #SGD cannot track additional parameters during training, so what advantage does SGD have over Adam in terms of peak memory? In my experiments, training a 7B parameter Llama 2 model using AdamW and LoRA (default setting r=8) required 14.18 GB of GPU memory. Training the same model with SGD requires 14.15 GB of GPU memory. Compared with AdamW, SGD only saves 0.03 GB of memory, which has a negligible effect. Why only save so much memory? This is because LoRA has greatly reduced the number of parameters in the model when using LoRA. For example, if r=8, out of all 6,738,415,616 parameters of the Llama 2 model at 7B, there are only 4,194,304 trainable LoRA parameters. Just looking at the numbers, 4,194,304 parameters may still be a lot, but in fact these many parameters only occupy 4,194,304 × 2 × 16 bits = 134.22 megabits = 16.78 megabytes. (We observed a difference of 0.03 Gb = 30 Mb due to the additional overhead in storing and copying the optimizer state.) 2 represents the number of additional parameters stored by Adam, while 16 bits refers to the model weights Default precision.
If we expand r of the LoRA matrix from 8 to 256, then the advantages of SGD compared to AdamW will appear:
- Using AdamW will occupy 17.86 GB of memory
- Using SGD will occupy 14.46 GB
Therefore, when the matrix size increases, the memory saved by SGD will play an important role. Because SGD does not need to store additional optimizer parameters, SGD can save more memory than other optimizers such as Adam when processing large models. This is a very important advantage for training tasks with limited memory. In traditional deep learning, we often train on the training set Multiple iterations are performed, and each iteration is called an epoch. For example, when training a convolutional neural network, you typically run it for hundreds of epochs. So, do multiple rounds of iterative training also have an effect on instruction fine-tuning? The answer is no, when I doubled the number of iterations on the 50k Alpaca example instruction fine-tuning dataset, the performance of the model dropped.
Therefore, I concluded that multiple rounds of iteration may not be conducive to instruction fine-tuning. I observed the same behavior in the 1k example LIMA instruction fine-tuning set. The decline in model performance may be caused by overfitting, and the specific reasons still need to be further explored. Using LoRA in more layersThe following table shows how LoRA is only used for Experiments with selected matrices (that is, the Key and Value matrices in each Transformer) working. Additionally, we can enable LoRA in the query weight matrix, projection layer, other linear layers between the multi-head attention modules, and the output layer.
If we add LoRA on top of these additional layers, the number of trainable parameters increases fivefold from 4,194,304 to 20,277,248 for the Llama 2 model at 7B. Applying LoRA to more layers can significantly improve model performance, but also requires higher memory space. Additionally, I only explored two settings: (1) LoRA with only the query and weight matrices enabled, and (2) LoRA with all layers enabled, in The effects of using LoRA in combinations with more layers are worthy of further study. If we can know whether using LoRA in the projection layer is beneficial to the training results, then we can better optimize the model and improve its performance.
Balanced LoRA Hyperparameters: R and AlphaAs stated in the paper that proposed LoRA, LoRA introduces an additional scaling factor. This coefficient is used to apply LoRA weights to pre-training during forward propagation. The extension involves the rank parameter r discussed previously, as well as another hyperparameter α (alpha), which is applied as follows:
As shown in the formula above, the LoRA weights The larger the value, the greater the impact. In the previous experiment, I used the parameters r=8, alpha=16, which resulted in a 2x expansion. When reducing weight for large models with LoRA, it is a common rule of thumb to set alpha to twice r. But I'm curious if this rule still holds for larger values of r.
I also tried r=32, r=64, r=128, and r=512, but omitted this process for clarity, but when r=256 , indeed the best effect. In fact, choosing alpha=2r does provide optimal results. Training a 7B parameter model on a single GPULoRA allows us to train a 7B parameter model on a single GPU Fine-tuning a large language model with a 7B parameter scale. In this particular case, processing 17.86 GB (50k training examples) of data using the AdamW optimizer took approximately 3 hours on the A100 with the best settings for QLoRA (r=256, alpha=512) (here Alpaca dataset).
In the remainder of this article, I will answer other questions you may have. Q1: How important is the data set? Dataset is crucial. I'm using the Alpaca dataset with 50k training examples. I chose Alpaca because it is so popular. Since this article is already very long, the test results on more data sets will not be discussed in this article. #Alpaca is a synthetic dataset that can be a bit outdated by today’s standards. Data quality is critical. For example, in June, I wrote a post discussing the LIMA dataset, a curated dataset consisting of just a thousand examples. Article link: https://magazine.sebastianraschka.com/p/ahead-of-ai-9-llm-tuning-and-datasetAs the title of the paper that proposed LIMA says: For alignment, less is more. Although the amount of data in LIMA is less than that of Alpaca, the 65B Llama model fine-tuned based on LIMA is better than Alpaca results. Using the same configuration (r=256, alpha=512), I obtained similar model performance on LIMA to Alpaca, which has 50 times the data. Q2: Is LoRA suitable for domain adaptation? #I don’t have a clear answer to this question yet. As a rule of thumb, knowledge is usually extracted from pre-training datasets. Normally, language models usually absorb knowledge from pre-training data sets, and the role of instruction fine-tuning is mainly to help LLM follow instructions better. Since computing power is a key factor limiting the training of large language models, LoRA can also be used to specialize data sets in specific fields to further pre-train existing pre-trained models. Training LLM. Also, it’s worth noting that my experiments included two arithmetic benchmarks. In both benchmarks, the model fine-tuned with LoRA performed significantly worse than the pre-trained base model. I speculate that this is because the Alpaca dataset does not lack corresponding arithmetic examples, causing the model to "forget" the arithmetic knowledge. Further research is needed to determine whether the model "forgot" the arithmetic knowledge or whether it stopped responding to the corresponding instructions. However, one conclusion can be drawn here: "When fine-tuning LLM, it is a good idea to have the data set contain examples of each task we care about." Q3: How to determine the optimal r value? # Currently I don’t have a better solution to this problem. The determination of the optimal r value requires specific analysis of specific problems based on the specific circumstances of each LLM and each data set. I speculate that a value of r that is too large will result in overfitting, while a value of r that is too small may not capture the diverse tasks in the dataset. I suspect the more task types there are in the data set, the larger the r value required. For example, if I only need the model to perform basic two-digit arithmetic operations, a small value of r may be sufficient. However, this is just my hypothesis and needs further research to verify. #Q4: Does LoRA need to be enabled for all layers? I only explored two settings: (1) LoRA with only the query and weight matrices enabled, and (2) LoRA with all layers enabled. The effects of using LoRA in combinations with more layers are worthy of further study. If we can know whether using LoRA in the projection layer is beneficial to the training results, then we can better optimize the model and improve its performance. If we consider the various settings (lora_query, lora_key, lora_value, lora_projection, lora_mlp, lora_head), there are 64 combinations to explore. Q5: How to avoid overfitting? Generally speaking, a larger r is more likely to lead to overfitting, because r determines the number of trainable parameters. If your model is overfitting, first consider lowering the r value or increasing the data set size. Additionally, you can try increasing the weight decay rate of the AdamW or SGD optimizer, or increasing the dropout value of the LoRA layer. I have not explored the dropout parameters of LoRA in the experiment (I used a fixed dropout rate of 0.05). The dropout parameters of LoRA are also a question worthy of research. #Q6: Are there any other optimizers to choose from? Sophia, released in May this year, is worth trying. Sophia is a scalable stochastic second-order optimizer for language model pre-training.According to the following paper: "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training", compared to Adam, Sophia is twice as fast and can achieve better performance. In short, Sophia, like Adam, implements normalization via gradient curvature rather than gradient variance. Paper link: https://arxiv.org/abs/2305.14342Q7: Still Are there other factors affecting memory usage? #In addition to accuracy and quantization settings, model size, batch size, and number of trainable LoRA parameters, the dataset also affects memory usage. The block size of Llama 2 is 4048 tokens, which means that Llama can process a sequence containing 4048 tokens at a time. If masks are added to subsequent tokens, the training sequence will be shorter, which can save a lot of memory. For example, the Alpaca data set is relatively small, with the longest sequence length being 1304 tokens. When I try to use other datasets with the longest sequence length of 2048 tokens, the memory usage jumps from 17.86 GB to 26.96 GB. Q8: Compared with full fine-tuning and RLHF, what are the advantages of LoRA? #I didn't experiment with RLHF, but I tried full trim. Full fine-tuning required at least 2 GPUs, took up 36.66 GB each, and took 3.5 hours to complete. However, the baseline test results are not good, which may be caused by overfitting or sub-optimal parameters. Q9: Can the weights of LoRA be combined? The answer is yes. During training, we separate the LoRA weights and pre-trained weights and join them at each forward pass. Assume that in the real world, there is an application with multiple sets of LoRA weights. Each set of weights corresponds to a user of the application. Then these weights are stored separately for Saving disk space makes sense. Also, pre-trained weights and LoRA weights can be merged after training to create a single model. This way, we don't have to apply LoRA weights on every forward pass. weight += (lora_B @ lora_A) * scaling
We can update the weights using the method shown above and save the combined weights. Similarly, we can continue to add many LoRA weight sets: weight += (lora_B_set1 @ lora_A_set1) * scaling_set1weight += (lora_B_set2 @ lora_A_set2) * scaling_set2weight += (lora_B_set3 @ lora_A_set3) * scaling_set3...
I No experiments have been done to evaluate this approach, but it is already possible through the scripts/merge_lora.py script provided in Lit-GPT. Script link: https://github.com/Lightning-AI/lit-gpt/blob/main/scripts/merge_lora.pyQ10: What is the performance of layer-by-layer optimal rank adaptation? #For simplicity, in deep neural networks we will usually set the same learning rate for each layer. The learning rate is a hyperparameter we need to optimize, and further, we can choose a different learning rate for each layer (in PyTorch, this is not a very complicated thing). However, this is rarely done in practice because this approach adds extra cost and there are many other parameters that can be adjusted in deep neural networks. Similar to choosing different learning rates for different layers, we can also choose different LoRA r values for different layers. I haven't tried it yet, but there is a document that details this method: "LLM Optimization: Layer-wise Optimal Rank Adaptation (LORA)". In theory, this approach sounds promising, providing a lot of scope for optimizing hyperparameters. Paper link: https://medium.com/@tom_21755/llm-optimization-layer-wise-optimal-rank-adaptation-lora-1444dfbc8e6aOriginal link: https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms?continueFlag=0c2e38ff6893fba31f1492d815bf928bThe above is the detailed content of It’s not that large models cannot afford global fine-tuning, it’s just that LoRA is more cost-effective and the tutorial is ready.. For more information, please follow other related articles on the PHP Chinese website!