Home  >  Article  >  Technology peripherals  >  ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

WBOY
WBOYOriginal
2024-06-10 20:18:19774browse

Improve the core mechanism of Transformer to focus, so that small models can play twice as big models!

ICML+2024 high-scoring paper, Caiyun Technology team built the DCFormer framework, replacing the Transformer core component attention module (MHA), and proposed a dynamically combined multi-head attention (DCMHA).

DCMHA removes the fixed binding of the search selection loop and transformation loop of the MHA attention head, allowing them to be dynamically combined based on input, which fundamentally improves the expression ability of the model.

The original meaning is that each layer has fixed H attention heads. Now it is almost understood that each layer has fixed H attention heads. Now it uses almost the same parameter amount and calculation. Power can dynamically combine up to HxH attention heads. The fine-tuned content can more clearly express the meaning of the original text, as follows: Each layer of the original model contains a fixed number of H attention heads. Now we can use

DCMHA plug-and-play to replace MHA in any Transformer architecture to obtain a new universal, efficient and scalable model. ArchitectureDCFormer.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

This work was jointly completed by researchers from Beijing University of Posts and Telecommunications and AI startup Caiyun Technology.

The model DCPythia-6.9B built by the researchers based on DCFormer is better than the open source Pythia-12B in terms of pre-training perplexity and downstream task evaluation.

The DCFormer model is comparable in performance to those Transformer models that require 1.7-2 times more calculations.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

What are the limitations of the multi-head attention module?

The scaling law of large models tells us that with the improvement of computing power, the model will be larger and have more data, and the model effect will become better and better. Although no one can clearly explain how high the ceiling of this road is and whether it can reach AGI, this is indeed the most common approach at present.

But in addition to this, another question is also worth thinking about: Most of the current large models are based on Transformer. They are built up one by one with Transformer blocks like building blocks. As a building block, Transformer In itself, how much room for improvement is there?

This is the basic question to be answered in model structure research, and it is also the starting point of the DCFormer work jointly completed by Caiyun Technology and Beijing University of Posts and Telecommunications.

In Transformer's multi-head attention module (MHA), each attention head works completely independently of each other.

This design has been very successful in practice because of its simplicity and ease of implementation. However, it also brings about the low-ranking of the attention score matrix, which weakens the expressive ability and the repetitive and redundant waste of the attention head function. It eliminates some disadvantages such as parameters and computing resources. Based on this, some research works in recent years have tried to introduce some form of interaction between attention heads.

According to the Transformer loop theory, in MHA, the behavior of each attention head is composed of WQ, WK, WV, WO four weight matrices describe (WO is obtained by cutting the output projection matrix of MHA) .

Among them, WQWK is called the QK loop (or search selection loop) , which determines which item in the context to focus on from the current token (some)token, for example:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

WOWV It is called the OV loop (or projection transformation loop) , which determines what information is retrieved from the token of concern (or what attributes are projected) is written into the residual stream at the current position, and then predicted Next token. For example:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

The researchers noticed that search (where to get it from) and transformation (what to get) are originally two independent things, and they should be able to specify and Free combination on demand (just like in SQL query, the selection conditions after WHERE and the attribute projection after SELECT are written separately), MHA forces them to be "bundled" in QKOV with one attention head, which limits Flexibility and expressiveness.

For example, suppose there is a model with attention heads A, B, and C whose QK and OV loops can complete the above example =, then replace it with:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

It is necessary to cross-combine the QK and OV loops of the existing attention heads, and the model may "not be able to turn a corner" (verified by the synthetic test set constructed by the researcher's system,

What does the dynamic combination of bull attention look like?

With this as a starting point, the research team of this article introduced the compose operation in MHA:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

As shown in the figure below, DCMHA is obtained:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Figure 1. Overall structure of DCMHA

The attention calculated by QWQ and KWK The score matrix AS and the attention weight matrix AW are linearly mapped on the num_heads dimension before being multiplied with VWV to obtain a new matrix A' , through different linear mapping matrices (composition map) , to achieve the effects of various attention head combinations.

For example, in Figure 2(c), the QK loops of heads 3 and 7 are combined with the OV loop of head 1 to form a "new" attention head.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
##△Figure 2. Simplified typical composition map functions of 8 attention heads, light colors represent large values
In order to maximize the expression ability, researchers hope that the mapping matrix is ​​dynamically generated

from the input , that is, dynamically determines how the attention heads are combined.

But the mapping matrix they want to generate is not one, but for each pair of query Q

i at the source position and key Kj at the destination position in the sequence. To generate such a matrix, the computational overhead and memory usage will be unacceptable.

To this end, they further decompose the mapping matrix into an input-independent static matrix W

b and a low-rank matrix w1w2 and a diagonal matrix Diag(wg), which are respectively responsible for the basic combination, the dynamic combination of the limited way (i.e. rank R between attention heads, and the head itself Dynamic gating (see Figure 2(d) and Figure 3(b)). The latter two matrices are dynamically generated by the Q matrix and the K matrix.

Reduce the calculation and parameter complexity to an almost negligible level without sacrificing the effect

(See the complexity analysis in the paper for details). Combined with JAX and PyTorch implementation-level optimization, DCFormer can train and infer efficiently.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Figure 3. How is the calculation of Compose
?

Scale expansion

To evaluate the quality of an architecture, the core indicator that researchers focus on is the efficiency of converting computing power into intelligence

(or performance computing power ratio) , that is, the model performance improvement that can be brought about by investing unit computing power - spending less computing power to get a better model.

From the scaling law curves in Figure 4 and Figure 5

(In logarithmic coordinates, the loss of each model architecture can be drawn as an approximate straight line as the computing power changes. The lower the loss, the better the model. Good) It can be seen that DCFormer can achieve the effect of the Transformer model with 1.7~2 times the computing power, that is, the intelligent conversion rate of the computing power is increased by 1.7~2 times.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Figure 4. Scale expansion effect of Transformer and DCFormer
ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Figure 5. Scale of Pythia and DCPythia Extension effect
How do you understand this improvement?

Since the birth of Transformer in 2017, from the perspective of improving performance and computing power ratio, GLU MLP and rotational position encoding RoPE are two of the few architectural improvements that have been proven to be universally effective and widely adopted in a large number of practices. .

The architecture that adds these two improvements to the original Transformer is also called Transformer++. The strongest open source models such as Llama and Mistral all use this architecture. Regardless of the Transformer or Transformer++ architecture, significant improvements can be obtained through DCMHA.

At the 1.4B model scale, the improvement of DCMHA is greater than the sum of the two improvements of Transformer++, and the scalability is better (comparison of the blue-green line and the black line in Figure 4, the improvement of DCMHA can be calculated as Force increases and decays more slowly, and comparison of Figures 4 and 5).

It can be said that DCFormer takes Transformer's capabilities to a new level.

Downstream task evaluation

The research team trained two models, DCPythia-2.8B and DCPythia-6.9B, to evaluate on mainstream NLP downstream tasks and compared them with the open source model Pythia of the same scale( Training uses the same hyperparameter settings as Pythia).

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Table 1. Performance of DCFormer and Pythia in downstream tasks

As can be seen from Table 1, DCPythia-2.8B and 6.9B are not only The ppl on the Pile validation set is lower, and it significantly exceeds Pythia on most downstream tasks. The average accuracy of DCPythia6.9B on ppl and downstream tasks even exceeds Pythia-12B.

DCFormer++2.8B is further improved compared to DCPythia-2.8B, verifying the effectiveness of the combination of DCMHA and Lllama architecture.

Training and inference speed

Although the introduction of DCMHA will bring additional training and inference overhead, it can be seen from Table 2 that the training speed of DCFormer++ is 74.5%-89.2% of Transformer++. The inference speed is 81.1%-89.7%, and as the model parameters increase, the additional computing overhead will gradually decrease.

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Table 2. Comparison of training and inference speeds between Transformer++ and DCFormer++

The training speed is in TPU v3 pod, the sequence length is 2048, and the batch_size is 1k Comparison obtained under the circumstances; the inference speed is evaluated on the A100 80G GPU, the input length is 1024, and the generation length is 128.

Ablation experiment

The results are as follows:

ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models
△Table 3. Ablation experiment of DCMHA

From Table 3 The following points can be seen:

  • Although adding static combination weights can reduce ppl, introducing dynamic combination weights can further reduce ppl, which illustrates the necessity of dynamic combination.
  • Low-rank dynamic combination performs better than dynamic gating.
  • The ppl obtained by using only query-wise or key-wise dynamic combination is very similar, and the gap with DCFormer++ is very small.
  • Doing attention head combination after softmax is more effective than doing it before softmax, probably because the probability after softmax can more directly affect the output.
  • The rank of the dynamic combination weight does not need to be set too large, which also illustrates the low rank of the combination weight.

In addition, the researchers also further reduced training and inference overhead by increasing the proportion of local attention layers and only using query-wise dynamic combination. See Table 10 of the paper for details.

In general, the research team has two conclusions.

About dynamic weights: Recent SSM and linear attention/RNN work such as Mamba, GLA, RWKV6, HGRN, etc. have caught up with Transformer++ by introducing dynamic (input-dependent) weights, but DCFormer uses dynamic The method of combining attention heads shows that when using softmax attention, the effect of Transformer++ can be greatly improved by introducing dynamic weights.

About model architecture innovation: This work shows that if there is an "ideal model architecture" with extreme computing power and intelligent transformation efficiency, although the current Transformer architecture is very powerful, it is probably still far from this ideal architecture. There is a big gap and there is still vast room for improvement. Therefore, in addition to the vigorous development of miracles by stacking computing power and data, innovation in model architecture also has great potential.

The research team also stated that Caiyun Technology will be the first to apply DCformer on its products Caiyun Weather, Caiyun Xiaoyi, and Caiyun Xiaomeng.

For more research details, please refer to the original paper.

ICML2024 paper link: https://icml.cc/virtual/2024/poster/34047.
Arxiv paper link: https://arxiv.org/abs/2405.08553.
Code link: https://github.com/Caiyun-AI/DCFormer.

The above is the detailed content of ICML2024 high score! Magically modify attention, allowing small models to fight twice as big models. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn