


In 2018, Google released BERT. Once it was released, it defeated the State-of-the-art (Sota) results of 11 NLP tasks in one fell swoop, becoming a new milestone in the NLP world; The structure of BERT is shown in the figure below , The left side is the BERT model pre-training process, and the right side is the fine-tuning process for specific tasks. Among them, the fine-tuning stage is for fine-tuning when it is subsequently used in some downstream tasks, such as: text classification, part-of-speech tagging, question and answer system, etc. BERT can be fine-tuned on different tasks without adjusting the structure. Through the task design of "pre-training language model and downstream task fine-tuning", it has brought powerful model effects. Since then, "pre-training language model and downstream task fine-tuning" has become the mainstream training paradigm in the field of NLP.
BERT structure diagram, the left side is the pre-training process, and the right side is the specific task fine-tuning process
However, with the emergence of large-scale architectures represented by GPT3, As the parameter size of large-scale language models (LLM) increases, full fine-tuning on consumer-grade hardware becomes unfeasible. The following table shows the CPU/GPU memory consumption of full model fine-tuning and parameter efficient fine-tuning on an A100 GPU (80G video memory) and hardware with a CPU memory of 64GB or more
Full Comparison of memory usage between parameter fine-tuning and parameter efficient fine-tuning
In addition, comprehensive fine-tuning of the model will also lead to the loss of diversity and serious forgetting problems. Therefore, how to efficiently perform model fine-tuning has become the focus of industry research, which also provides research space for the rapid development of efficient parameter fine-tuning technology
Efficient parameter fine-tuning refers to fine-tuning a small amount or additional model parameters, with a fixed large Partially pre-trained model (LLM) parameters, thereby greatly reducing computing and storage costs, while also achieving performance comparable to full parameter fine-tuning. The parameter efficient fine-tuning method is even better than full fine-tuning in some cases, and can be better generalized to out-of-domain scenarios.
Efficient fine-tuning technology can be roughly divided into the following three categories, as shown in the figure below: adding additional parameters (A), selecting a part of parameters to update (S), and introducing heavy parameterization (R). Among the methods of adding additional parameters, they are mainly divided into two subcategories: Adapter-like methods and Soft prompts.
Common parameter efficient fine-tuning technologies include BitFit, Prefix Tuning, Prompt Tuning, P-Tuning, Adapter Tuning, LoRA, etc. The following chapters will explain in detail some mainstream efficient parameter fine-tuning methods
Common parameter efficient fine-tuning technologies and methods
BitFit/Prefix/Prompt Fine-tuning Series
BitFit
Although full fine-tuning for each task is very effective, it will also generate a Uniquely large models, which make it difficult to infer what changes occurred during fine-tuning, are also difficult to deploy, and especially difficult to maintain as the number of tasks increases.
Ideally, we hope to have an efficient fine-tuning method that meets the following conditions:
The above issues depend on the extent to which the fine-tuning process can guide the learning of new abilities and exposure to pre-training LM secondary schools ability. Although, the previous efficient fine-tuning methods Adapter-Tuning and Diff-Pruning can also partially meet the above needs. BitFit, a sparse fine-tuning method with smaller parameters, can meet all the above needs.
BitFit is a sparse fine-tuning method that only updates the bias parameters or part of the bias parameters during training. For the Transformer model, most of the transformer-encoder parameters are frozen, and only the bias parameters and the classification layer parameters of the specific task are updated. The bias parameters involved include the bias involved in calculating query, key, value and merging multiple attention results in the attention module, the bias in the MLP layer, the bias parameter in the Layernormalization layer, and the bias parameters in the pre-training model as shown in the figure below. .
Picture
The PLM module represents a specific PLM sub-layer, such as attention or FFN. The orange block in the figure indicates the trainable Hint vector, blue blocks represent frozen pre-trained model parameters
In models such as Bert-Base/Bert-Large, the bias parameter only accounts for 0.08%~0.09% of the total parameters of the model. However, by comparing the effects of BitFit, Adapter and Diff-Pruning on the Bert-Large model based on the GLUE data set, it was found that BitFit has the same effect as Adapter and Diff-Pruning when the number of parameters is much smaller than that of Adapter and Diff-Pruning. , even slightly better than Adapter and Diff-Pruning in some tasks.
It can be seen from the experimental results that compared to the fine-tuning of all parameters, the BitFit fine-tuning results only updated a very small number of parameters, and achieved good results on multiple data sets. Although it is not as good as fine-tuning all parameters, it is far better than the Frozen method of fixing all model parameters. At the same time, by comparing the parameters before and after BitFit training, it was found that many bias parameters did not change much, such as the bias parameters related to calculating the key. It is found that the bias parameters of the FFN layer that calculates the query and enlarges the feature dimension from N to 4N have the most obvious changes. Only updating these two types of bias parameters can also achieve good results. On the contrary, if any one of them is fixed, the effect of the model will be greatly lost.
Prefix Tuning
The work before Prefix Tuning was mainly to manually design discrete templates or automatically search for discrete templates. For manually designed templates, changes in the template are particularly sensitive to the final performance of the model. Adding a word, missing a word, or changing the position will cause relatively large changes. For automated search templates, the cost is relatively high; at the same time, the results of previous discrete token searches may not be optimal. In addition, the traditional fine-tuning paradigm uses pre-trained models to fine-tune different downstream tasks, and a fine-tuned model weight must be saved for each task. On the one hand, fine-tuning the entire model takes a long time; on the other hand, it will also Takes up a lot of storage space. Based on the above two points, Prefix Tuning proposes a fixed pre-training LM, adding trainable, task-specific prefixes to LM, so that different prefixes can be saved for different tasks, and the fine-tuning cost is also small; at the same time, this kind of Prefix is actually Continuously differentiable Virtual Token (Soft Prompt/Continuous Prompt) is better optimized and has better effect than discrete Token.
So, what needs to be rewritten is: So what is the meaning of prefix? The role of prefix is to guide the model to extract information related to x, so as to better generate y. For example, if we want to do a summary task, then after fine-tuning, prefix can understand that what we are currently doing is a "summarization form" task, and then guide the model to extract key information from x; if we want to do an emotion classification Task, prefix can guide the model to extract the semantic information related to emotion in x, and so on. This explanation may not be so rigorous, but you can roughly understand the role of prefix
Prefix Tuning is to construct a task-related virtual tokens as Prefix before inputting the token, and then only update the parameters of the Prefix part during training. , while other parameters in PLM are fixed. For different model structures, different Prefixes need to be constructed:
- For autoregressive architecture models: add a prefix in front of the sentence to get z = [PREFIX; x; y]. The appropriate above can be in With the LM fixed, guide the generation of context (for example: context learning of GPT3).
- For the encoder-decoder architecture model: Prefixes are added to both Encoder and Decoder, resulting in z = [PREFIX; x; PREFIX0; y]. The prefix is added on the Encoder side to guide the encoding of the input part, and the prefix is added on the Decoder side to guide subsequent token generation.
Picture
Rewrite the content without changing the original meaning, and rewrite it in Chinese: For the fine-tuning in the previous part, we update all Transformer parameters (red box) and need to store a complete copy of the model for each task. The prefix adjustment in the lower part will freeze the Transformer parameters and only optimize the prefix (red box)
This method is actually similar to constructing Prompt, except that Prompt is an artificially constructed "explicit" prompt. And the parameters cannot be updated, while Prefix is an "implicit" hint that can be learned. At the same time, in order to prevent the direct update of the parameters of Prefix from causing unstable training and performance degradation, an MLP structure is added in front of the Prefix layer. After training is completed, only the parameters of Prefix are retained. In addition, ablation experiments have proven that adjusting the embedding layer alone is not expressive enough, which will lead to a significant performance decline. Therefore, prompt parameters are added to each layer, which is a major change.
Although Prefix Tuning seems convenient, it also has the following two significant disadvantages:
Prompt Tuning
Full fine-tuning of large models requires training a model for each task, which has relatively high overhead and deployment costs. At the same time, the discrete prompts (referring to manually designing prompts and adding prompts to the model) method is relatively expensive and the effect is not very good. Prompt Tuning learns prompts by backpropagating updated parameters instead of manually designing prompts; at the same time, it freezes the original weights of the model and only trains prompts parameters. After training, the same model can be used for multi-task inference.
Picture
Model tuning requires making task-specific copies of the entire pre-trained model for each task. Downstream tasks and inference must be in separate batches . Prompt Tuning only requires storing a small task-specific prompt for each task and enables mixed-task inference using the original pre-trained model.
Prompt Tuning can be seen as a simplified version of Prefix Tuning. It defines its own prompt for each task, and then splices it into the data as input, but only adds prompt tokens to the input layer, and There is no need to add MLP for adjustment to solve difficult training problems.
It was found through experiments that as the number of parameters of the pre-trained model increases, the Prompt Tuning method will approach the results of full parameter fine-tuning. At the same time, Prompt Tuning also proposed Prompt Ensembling, which means training different prompts for the same task at the same time in a batch (that is, asking the same question in multiple different ways). This is equivalent to training different models. For example The cost of model integration is much smaller. In addition, the Prompt Tuning paper also discusses the impact of the initialization method and length of the Prompt token on model performance. Through ablation experiment results, it is found that Prompt Tuning uses class labels to initialize the model better than random initialization and initialization using sample vocabulary. However, as the model parameter scale increases, this gap will eventually disappear. When the length of Prompt token is around 20, the performance is already good (after exceeding 20, increasing the length of Prompt token will not significantly improve the performance of the model). Similarly, this gap will also decrease as the scale of model parameters increases ( That is, for very large-scale models, even if the Prompt token length is very short, it will not have much impact on performance).
The above is the detailed content of Efficient parameter fine-tuning of large-scale language models--BitFit/Prefix/Prompt fine-tuning series. For more information, please follow other related articles on the PHP Chinese website!

语言模型是对文本进行推理的,文本通常是字符串形式,但模型的输入只能是数字,因此需要将文本转换成数字形式。Tokenization是自然语言处理的基本任务,根据特定需求能够把一段连续的文本序列(如句子、段落等)切分为一个字符序列(如单词、短语、字符、标点等多个单元),其中的单元称为token或词语。根据下图所示的具体流程,首先将文本句子切分成一个个单元,然后将单元素数值化(映射为向量),再将这些向量输入到模型进行编码,最后输出到下游任务进一步得到最终的结果。文本切分按照文本切分的粒度可以将Toke

编辑|ScienceAI问答(QA)数据集在推动自然语言处理(NLP)研究发挥着至关重要的作用。高质量QA数据集不仅可以用于微调模型,也可以有效评估大语言模型(LLM)的能力,尤其是针对科学知识的理解和推理能力。尽管当前已有许多科学QA数据集,涵盖了医学、化学、生物等领域,但这些数据集仍存在一些不足。其一,数据形式较为单一,大多数为多项选择题(multiple-choicequestions),它们易于进行评估,但限制了模型的答案选择范围,无法充分测试模型的科学问题解答能力。相比之下,开放式问答

编译|星璇出品|51CTO技术栈(微信号:blog51cto)在过去的两年里,我更多地参与了使用大型语言模型(LLMs)的生成AI项目,而非传统的系统。我开始怀念无服务器云计算。它们的应用范围广泛,从增强对话AI到为各行各业提供复杂的分析解决方案,以及其他许多功能。许多企业将这些模型部署在云平台上,因为公共云提供商已经提供了现成的生态系统,而且这是阻力最小的路径。然而,这并不便宜。云还提供了其他好处,如可扩展性、效率和高级计算能力(按需提供GPU)。在公共云平台上部署LLM的过程有一些鲜为人知的

近几年自然语言处理的进展很大程度上都来自于大规模语言模型,每次发布的新模型都将参数量、训练数据量推向新高,同时也会对现有基准排行进行一次屠榜!比如今年4月,Google发布5400亿参数的语言模型PaLM(Pathways Language Model)在语言和推理类的一系列测评中成功超越人类,尤其是在few-shot小样本学习场景下的优异性能,也让PaLM被认为是下一代语言模型的发展方向。同理,视觉语言模型其实也是大力出奇迹,可以通过提升模型的规模来提升性能。当然了,如果只是多任务的视觉语言模

2018年谷歌发布了BERT,一经面世便一举击败11个NLP任务的State-of-the-art(Sota)结果,成为了NLP界新的里程碑;BERT的结构如下图所示,左边是BERT模型预训练过程,右边是对于具体任务的微调过程。其中,微调阶段是后续用于一些下游任务的时候进行微调,例如:文本分类,词性标注,问答系统等,BERT无需调整结构就可以在不同的任务上进行微调。通过”预训练语言模型+下游任务微调”的任务设计,带来了强大的模型效果。从此,“预训练语言模型+下游任务微调”便成为了NLP领域主流训

随着语言模型扩展到前所未有的规模,对下游任务进行全面微调变得十分昂贵。为了解决这个问题,研究人员开始关注并采用PEFT方法。PEFT方法的主要思想是将微调的范围限制在一小部分参数上,以降低计算成本,同时仍能实现自然语言理解任务的最先进性能。通过这种方式,研究人员能够在保持高性能的同时,节省计算资源,为自然语言处理领域带来新的研究热点。RoSA是一种新的PEFT技术,通过在一组基准测试的实验中,发现在使用相同参数预算的情况下,RoSA表现出优于先前的低秩自适应(LoRA)和纯稀疏微调方法。本文将深

2月25日消息,Meta在当地时间周五宣布,它将推出一种针对研究社区的基于人工智能(AI)的新型大型语言模型,与微软、谷歌等一众受到ChatGPT刺激的公司一同加入人工智能竞赛。Meta的LLaMA是“大型语言模型MetaAI”(LargeLanguageModelMetaAI)的缩写,它可以在非商业许可下提供给政府、社区和学术界的研究人员和实体工作者。该公司将提供底层代码供用户使用,因此用户可以自行调整模型,并将其用于与研究相关的用例。Meta表示,该模型对算力的要

译者 | 李睿审校 | 孙淑娟BigScience研究项目日前发布了一个大型语言模型BLOOM,乍一看,它看起来像是复制OpenAI的GPT-3的又一次尝试。 但BLOOM与其他大型自然语言模型(LLM)的不同之处在于,它在研究、开发、培训和发布机器学习模型方面所做的努力。近年来,大型科技公司将大型自然语言模型(LLM)就像严守商业机密一样隐藏起来,而BigScience团队从项目一开始就把透明与开放放在了BLOOM的中心。 其结果是一个大型语言模型,可以供研究和学习,并可供所有人使用。B


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Zend Studio 13.0.1
Powerful PHP integrated development environment

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.
