


Improve the core mechanism of Transformer to focus, so that small models can play twice as big models!
ICML+2024 high-scoring paper, Caiyun Technology team built the DCFormer framework, replacing the Transformer core component attention module (MHA), and proposed a dynamically combined multi-head attention (DCMHA).
DCMHA removes the fixed binding of the search selection loop and transformation loop of the MHA attention head, allowing them to be dynamically combined based on input, which fundamentally improves the expression ability of the model.
The original meaning is that each layer has fixed H attention heads. Now it is almost understood that each layer has fixed H attention heads. Now it uses almost the same parameter amount and calculation. Power can dynamically combine up to HxH attention heads. The fine-tuned content can more clearly express the meaning of the original text, as follows: Each layer of the original model contains a fixed number of H attention heads. Now we can use
DCMHA plug-and-play to replace MHA in any Transformer architecture to obtain a new universal, efficient and scalable model. ArchitectureDCFormer.
This work was jointly completed by researchers from Beijing University of Posts and Telecommunications and AI startup Caiyun Technology.
The model DCPythia-6.9B built by the researchers based on DCFormer is better than the open source Pythia-12B in terms of pre-training perplexity and downstream task evaluation.
The DCFormer model is comparable in performance to those Transformer models that require 1.7-2 times more calculations.
What are the limitations of the multi-head attention module?
The scaling law of large models tells us that with the improvement of computing power, the model will be larger and have more data, and the model effect will become better and better. Although no one can clearly explain how high the ceiling of this road is and whether it can reach AGI, this is indeed the most common approach at present.
But in addition to this, another question is also worth thinking about: Most of the current large models are based on Transformer. They are built up one by one with Transformer blocks like building blocks. As a building block, Transformer In itself, how much room for improvement is there?
This is the basic question to be answered in model structure research, and it is also the starting point of the DCFormer work jointly completed by Caiyun Technology and Beijing University of Posts and Telecommunications.
In Transformer's multi-head attention module (MHA), each attention head works completely independently of each other.
This design has been very successful in practice because of its simplicity and ease of implementation. However, it also brings about the low-ranking of the attention score matrix, which weakens the expressive ability and the repetitive and redundant waste of the attention head function. It eliminates some disadvantages such as parameters and computing resources. Based on this, some research works in recent years have tried to introduce some form of interaction between attention heads.
According to the Transformer loop theory, in MHA, the behavior of each attention head is composed of WQ, WK, WV, WO four weight matrices describe (WO is obtained by cutting the output projection matrix of MHA) .
Among them, WQWK is called the QK loop (or search selection loop) , which determines which item in the context to focus on from the current token (some)token, for example:
WOWV It is called the OV loop (or projection transformation loop) , which determines what information is retrieved from the token of concern (or what attributes are projected) is written into the residual stream at the current position, and then predicted Next token. For example:
The researchers noticed that search (where to get it from) and transformation (what to get) are originally two independent things, and they should be able to specify and Free combination on demand (just like in SQL query, the selection conditions after WHERE and the attribute projection after SELECT are written separately), MHA forces them to be "bundled" in QKOV with one attention head, which limits Flexibility and expressiveness.
For example, suppose there is a model with attention heads A, B, and C whose QK and OV loops can complete the above example =, then replace it with:
It is necessary to cross-combine the QK and OV loops of the existing attention heads, and the model may "not be able to turn a corner" (verified by the synthetic test set constructed by the researcher's system,
What does the dynamic combination of bull attention look like?
With this as a starting point, the research team of this article introduced the compose operation in MHA:
As shown in the figure below, DCMHA is obtained:
△Figure 1. Overall structure of DCMHA
The attention calculated by QWQ and KWK The score matrix AS and the attention weight matrix AW are linearly mapped on the num_heads dimension before being multiplied with VWV to obtain a new matrix A' , through different linear mapping matrices (composition map) , to achieve the effects of various attention head combinations.
For example, in Figure 2(c), the QK loops of heads 3 and 7 are combined with the OV loop of head 1 to form a "new" attention head.
In order to maximize the expression ability, researchers hope that the mapping matrix is dynamically generated
from the input , that is, dynamically determines how the attention heads are combined.
But the mapping matrix they want to generate is not one, but for each pair of query Qi at the source position and key Kj at the destination position in the sequence. To generate such a matrix, the computational overhead and memory usage will be unacceptable.
To this end, they further decompose the mapping matrix into an input-independent static matrix Wb and a low-rank matrix w1w2 and a diagonal matrix Diag(wg), which are respectively responsible for the basic combination, the dynamic combination of the limited way (i.e. rank R between attention heads, and the head itself Dynamic gating (see Figure 2(d) and Figure 3(b)). The latter two matrices are dynamically generated by the Q matrix and the K matrix.
Reduce the calculation and parameter complexity to an almost negligible level without sacrificing the effect(See the complexity analysis in the paper for details). Combined with JAX and PyTorch implementation-level optimization, DCFormer can train and infer efficiently.
(or performance computing power ratio) , that is, the model performance improvement that can be brought about by investing unit computing power - spending less computing power to get a better model.
From the scaling law curves in Figure 4 and Figure 5(In logarithmic coordinates, the loss of each model architecture can be drawn as an approximate straight line as the computing power changes. The lower the loss, the better the model. Good) It can be seen that DCFormer can achieve the effect of the Transformer model with 1.7~2 times the computing power, that is, the intelligent conversion rate of the computing power is increased by 1.7~2 times.
Downstream task evaluation
The research team trained two models, DCPythia-2.8B and DCPythia-6.9B, to evaluate on mainstream NLP downstream tasks and compared them with the open source model Pythia of the same scale( Training uses the same hyperparameter settings as Pythia).
△Table 1. Performance of DCFormer and Pythia in downstream tasks
As can be seen from Table 1, DCPythia-2.8B and 6.9B are not only The ppl on the Pile validation set is lower, and it significantly exceeds Pythia on most downstream tasks. The average accuracy of DCPythia6.9B on ppl and downstream tasks even exceeds Pythia-12B.
DCFormer++2.8B is further improved compared to DCPythia-2.8B, verifying the effectiveness of the combination of DCMHA and Lllama architecture.
Training and inference speed
Although the introduction of DCMHA will bring additional training and inference overhead, it can be seen from Table 2 that the training speed of DCFormer++ is 74.5%-89.2% of Transformer++. The inference speed is 81.1%-89.7%, and as the model parameters increase, the additional computing overhead will gradually decrease.
△Table 2. Comparison of training and inference speeds between Transformer++ and DCFormer++
The training speed is in TPU v3 pod, the sequence length is 2048, and the batch_size is 1k Comparison obtained under the circumstances; the inference speed is evaluated on the A100 80G GPU, the input length is 1024, and the generation length is 128.
Ablation experiment
The results are as follows:
△Table 3. Ablation experiment of DCMHA
From Table 3 The following points can be seen:
- Although adding static combination weights can reduce ppl, introducing dynamic combination weights can further reduce ppl, which illustrates the necessity of dynamic combination.
- Low-rank dynamic combination performs better than dynamic gating.
- The ppl obtained by using only query-wise or key-wise dynamic combination is very similar, and the gap with DCFormer++ is very small.
- Doing attention head combination after softmax is more effective than doing it before softmax, probably because the probability after softmax can more directly affect the output.
- The rank of the dynamic combination weight does not need to be set too large, which also illustrates the low rank of the combination weight.
In addition, the researchers also further reduced training and inference overhead by increasing the proportion of local attention layers and only using query-wise dynamic combination. See Table 10 of the paper for details.
In general, the research team has two conclusions.
About dynamic weights: Recent SSM and linear attention/RNN work such as Mamba, GLA, RWKV6, HGRN, etc. have caught up with Transformer++ by introducing dynamic (input-dependent) weights, but DCFormer uses dynamic The method of combining attention heads shows that when using softmax attention, the effect of Transformer++ can be greatly improved by introducing dynamic weights.
About model architecture innovation: This work shows that if there is an "ideal model architecture" with extreme computing power and intelligent transformation efficiency, although the current Transformer architecture is very powerful, it is probably still far from this ideal architecture. There is a big gap and there is still vast room for improvement. Therefore, in addition to the vigorous development of miracles by stacking computing power and data, innovation in model architecture also has great potential.
The research team also stated that Caiyun Technology will be the first to apply DCformer on its products Caiyun Weather, Caiyun Xiaoyi, and Caiyun Xiaomeng.
For more research details, please refer to the original paper.
ICML2024 paper link: https://icml.cc/virtual/2024/poster/34047.
Arxiv paper link: https://arxiv.org/abs/2405.08553.
Code link: https://github.com/Caiyun-AI/DCFormer.
The above is the detailed content of ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models. For more information, please follow other related articles on the PHP Chinese website!

机器学习是一个不断发展的学科,一直在创造新的想法和技术。本文罗列了2023年机器学习的十大概念和技术。 本文罗列了2023年机器学习的十大概念和技术。2023年机器学习的十大概念和技术是一个教计算机从数据中学习的过程,无需明确的编程。机器学习是一个不断发展的学科,一直在创造新的想法和技术。为了保持领先,数据科学家应该关注其中一些网站,以跟上最新的发展。这将有助于了解机器学习中的技术如何在实践中使用,并为自己的业务或工作领域中的可能应用提供想法。2023年机器学习的十大概念和技术:1. 深度神经网

实现自我完善的过程是“机器学习”。机器学习是人工智能核心,是使计算机具有智能的根本途径;它使计算机能模拟人的学习行为,自动地通过学习来获取知识和技能,不断改善性能,实现自我完善。机器学习主要研究三方面问题:1、学习机理,人类获取知识、技能和抽象概念的天赋能力;2、学习方法,对生物学习机理进行简化的基础上,用计算的方法进行再现;3、学习系统,能够在一定程度上实现机器学习的系统。

本文将详细介绍用来提高机器学习效果的最常见的超参数优化方法。 译者 | 朱先忠审校 | 孙淑娟简介通常,在尝试改进机器学习模型时,人们首先想到的解决方案是添加更多的训练数据。额外的数据通常是有帮助(在某些情况下除外)的,但生成高质量的数据可能非常昂贵。通过使用现有数据获得最佳模型性能,超参数优化可以节省我们的时间和资源。顾名思义,超参数优化是为机器学习模型确定最佳超参数组合以满足优化函数(即,给定研究中的数据集,最大化模型的性能)的过程。换句话说,每个模型都会提供多个有关选项的调整“按钮

截至3月20日的数据显示,自微软2月7日推出其人工智能版本以来,必应搜索引擎的页面访问量增加了15.8%,而Alphabet旗下的谷歌搜索引擎则下降了近1%。 3月23日消息,外媒报道称,分析公司Similarweb的数据显示,在整合了OpenAI的技术后,微软旗下的必应在页面访问量方面实现了更多的增长。截至3月20日的数据显示,自微软2月7日推出其人工智能版本以来,必应搜索引擎的页面访问量增加了15.8%,而Alphabet旗下的谷歌搜索引擎则下降了近1%。这些数据是微软在与谷歌争夺生

荣耀的人工智能助手叫“YOYO”,也即悠悠;YOYO除了能够实现语音操控等基本功能之外,还拥有智慧视觉、智慧识屏、情景智能、智慧搜索等功能,可以在系统设置页面中的智慧助手里进行相关的设置。

阅读论文可以说是我们的日常工作之一,论文的数量太多,我们如何快速阅读归纳呢?自从ChatGPT出现以后,有很多阅读论文的服务可以使用。其实使用ChatGPT API非常简单,我们只用30行python代码就可以在本地搭建一个自己的应用。 阅读论文可以说是我们的日常工作之一,论文的数量太多,我们如何快速阅读归纳呢?自从ChatGPT出现以后,有很多阅读论文的服务可以使用。其实使用ChatGPT API非常简单,我们只用30行python代码就可以在本地搭建一个自己的应用。使用 Python 和 C

人工智能在教育领域的应用主要有个性化学习、虚拟导师、教育机器人和场景式教育。人工智能在教育领域的应用目前还处于早期探索阶段,但是潜力却是巨大的。

人工智能在生活中的应用有:1、虚拟个人助理,使用者可通过声控、文字输入的方式,来完成一些日常生活的小事;2、语音评测,利用云计算技术,将自动口语评测服务放在云端,并开放API接口供客户远程使用;3、无人汽车,主要依靠车内的以计算机系统为主的智能驾驶仪来实现无人驾驶的目标;4、天气预测,通过手机GPRS系统,定位到用户所处的位置,在利用算法,对覆盖全国的雷达图进行数据分析并预测。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Dreamweaver CS6
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools