Rumah >Peranti teknologi >AI >Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

王林
王林asal
2024-08-05 14:20:15887semak imbas

Pada Disember tahun lalu, seni bina baharu Mamba meletupkan bulatan AI ​​​​dan melancarkan cabaran kepada Transformer yang sentiasa ada. Hari ini, pelancaran Google DeepMind “Hawk” dan “Griffin” menyediakan pilihan baharu untuk kalangan AI.


Kali ini, Google DeepMind telah membuat langkah baharu dalam model asas.

Kami tahu bahawa rangkaian saraf berulang (RNN) memainkan peranan penting pada zaman awal pembelajaran mendalam dan penyelidikan pemprosesan bahasa semula jadi dan telah mencapai hasil praktikal dalam banyak aplikasi, termasuk sistem terjemahan mesin hujung ke hujung pertama Google . Walau bagaimanapun, dalam beberapa tahun kebelakangan ini, pembelajaran mendalam dan NLP telah dikuasai oleh seni bina Transformer, yang menggabungkan perceptron berbilang lapisan (MLP) dan perhatian berbilang kepala (MHA).

Transformer telah mencapai prestasi yang lebih baik daripada RNN dalam amalan dan juga sangat cekap dalam memanfaatkan perkakasan moden. Model bahasa besar berasaskan transformer dilatih pada set data besar-besaran yang dikumpulkan daripada web dengan kejayaan yang luar biasa.

Walaupun telah mencapai kejayaan yang hebat, seni bina Transformer masih mempunyai kekurangan Contohnya, disebabkan kerumitan kuadratik perhatian global, Transformer sukar untuk dilanjutkan dengan berkesan ke jujukan yang panjang. Selain itu, cache nilai kunci (KV) berkembang secara linear dengan panjang jujukan, menyebabkan Transformer menjadi perlahan semasa inferens. Pada ketika ini, model bahasa berulang menjadi alternatif, mereka boleh memampatkan keseluruhan jujukan ke dalam keadaan tersembunyi bersaiz tetap dan mengemas kininya secara berulang. Tetapi jika ia ingin menggantikan Transformer, model RNN baharu bukan sahaja mesti menunjukkan prestasi yang setanding dalam penskalaan, tetapi juga mencapai kecekapan perkakasan yang serupa.

Dalam kertas kerja baru-baru ini oleh Google DeepMind, penyelidik mencadangkan lapisan RG-LRU, yang merupakan lapisan gelung linear berpagar baru dan mereka bentuk blok gelung baharu di sekelilingnya untuk menggantikan Perhatian berbilang pertanyaan (MQA).

Mereka menggunakan blok gelung ini untuk membina dua model baharu, Satu ialah model Hawk yang mencampurkan MLP dan blok gelung, Yang lain ialah model Griffin yang menggabungkan MLP dengan blok gelung dan perhatian tempatan.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

  • Tajuk kertas: Griffin: Mencampurkan Ulangan Linear Berpagar dengan Perhatian Tempatan untuk Model Bahasa Cekap
  • Pautan kertas: https://arxiv.org/pdf/24702.1942.194
  • Para penyelidik mengatakan bahawa Hawk dan Griffin mempamerkan penskalaan undang-undang kuasa antara kehilangan yang ditahan dan FLOP latihan, sehingga parameter 7B, seperti yang diperhatikan sebelum ini dalam Transformers. Antaranya, Griffin mencapai kerugian tertahan sedikit lebih rendah daripada garis dasar Transformer yang berkuasa pada semua saiz model.

Para penyelidik telah melatih Hawk dan Griffin pada token 300B untuk pelbagai saiz model Keputusan menunjukkan bahawa Hawk-3B mengatasi Mamba-3B dalam prestasi tugasan hiliran, walaupun bilangan token terlatih hanya separuh daripada jumlah token. yang terakhir. Griffin-7B dan Griffin-14B berprestasi setanding dengan Llama-2 walaupun dilatih hanya pada 1/7 bilangan token.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

Selain itu, Hawk dan Griffin mencapai kecekapan latihan yang setanding dengan Transformers pada TPU-v3. Memandangkan lapisan RNN pepenjuru adalah terhad memori, para penyelidik menggunakan inti lapisan RG-LRU untuk mencapai ini.

Juga semasa inferens, kedua-dua Hawk dan Griffin mencapai daya pemprosesan yang lebih tinggi daripada MQA Transformer dan mencapai kependaman yang lebih rendah apabila mensampel jujukan yang panjang. Griffin berprestasi lebih baik daripada Transformers apabila jujukan yang dinilai lebih panjang daripada yang diperhatikan dalam latihan, dan boleh mempelajari tugas penyalinan dan pengambilan semula secara berkesan daripada data latihan. Walau bagaimanapun, apabila model pra-latihan dinilai pada salinan dan tugasan mendapatkan semula yang tepat tanpa penalaan halus, Hawk dan Griffin berprestasi lebih teruk daripada Transformers.

Pengarang bersama dan saintis penyelidikan DeepMind Aleksandar Botev berkata bahawa Griffin, model yang menggabungkan gelung linear berpagar dan perhatian tempatan, mengekalkan semua kelebihan kecekapan tinggi RNN dan keupayaan ekspresif Transformer, dan boleh diperluaskan kepada skala parameter 14B. 634082795780

Griffin Model Architecture

Griffin All models contain the following components: (i) a residual block, (ii) an MLP block, (iii) a temporal mixing block. (i) and (ii) are the same for all models, but there are three temporal mixing blocks: global multi-query attention (MQA), local (sliding window) MQA and the recurrent block proposed in this paper. As part of the recurrent block, the researchers used a Really Gated Linear Recurrent Unit (RG-LRU), a new recurrent layer inspired by linear recurrent units.

As shown in Figure 2(a), the residual block defines the global structure of the Griffin model, which is inspired by the pre-normTransformer. After embedding the input sequence, we pass it through blocks like ? (? represents model depth) and then apply RMSNorm to generate the final activations. To calculate token probabilities, a final linear layer is applied, followed by softmax. The weights of this layer are shared with the input embedding layer.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

Recurrent model, scaling efficiency comparable to Transformer

Scaling research provides important insights into how to adjust the hyperparameters of the model and its behavior when scaling.

The researchers defined the models evaluated in this study, provided scaling curves up to and beyond 7B parameters, and evaluated model performance on downstream tasks.

They considered 3 model families: (1) MQA-Transformer baseline; (2) Hawk: a pure RNN model; (3) Griffin: a hybrid model that mixes recurrent blocks with local attention. Key model hyperparameters for models of various sizes are defined in Appendix C.

Hawk architecture uses the same residual pattern and MLP block as the Transformer baseline, but the researchers used a recurrent block with an RG-LRU layer as the temporal mixing block instead of MQA. They expanded the width of the loop block by a factor of about 4/3 (i.e., ?_??? ≈4?/3) to roughly match the number of parameters of the MHA block when both use the same model dimension ?.

Griffin. The main advantage of recurrent blocks compared to global attention is that they use a fixed state size to summarize sequences, whereas MQA's KV cache size grows proportionally to the sequence length. Local attention has the same properties, and mixing recurrent blocks with local attention preserves this advantage. The researchers found this combination to be extremely efficient because local attention can accurately model the recent past, while recurrent layers can convey information over long sequences.

Griffin uses the same residual pattern and MLP blocks as the Transformer baseline. But unlike the MQA Transformer baseline and Hawk model, Griffin uses a mix of loop blocks and MQA blocks. Specifically, we adopt a hierarchical structure that alternates two residual blocks with a recurrent block and then a local (MQA) attention block. Unless otherwise stated, the local attention window size is fixed at 1024 tokens.

The main scaling results are shown in Figure 1(a). All three model families were trained on model sizes ranging from 100 million to 7 billion parameters, although Griffin has a 14 billion parameter version. The evaluation results of

on downstream tasks are shown in Table 1:

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

Hawk and Griffin both played really well. The table above reports feature-normalized accuracy for MMLU, HellaSwag, PIQA, ARC-E, and ARC-C, while reporting absolute accuracy and partial scores for WinoGrande. As the size of the model increases, Hawk's performance also improves significantly, and Hawk-3B performs better than Mamba-3B in downstream tasks, although the number of tokens it trains is only half of Mamba-3B. Griffin-3B performs significantly better than Mamba-3B, and Griffin-7B and Griffin-14B perform comparable to Llama-2, although they are trained on nearly 7x fewer tokens. Hawk is comparable to the MQA Transformer baseline, while Griffin outperforms it.

Efficiently train the loop model on the device side

When developing and extending the model, researchers encountered two major engineering challenges. First, how to efficiently shard processing models across multiple devices. Second, how to effectively implement linear loops to maximize TPU training efficiency. This article discusses these two challenges and then provides an empirical comparison of the training speed of Griffin and MQA baselines.

The researchers compared the training speeds of different model sizes and sequence lengths to study the computational advantages of the model in this article during the training process. The total number of tokens per batch is kept fixed for each model size, which means that as the sequence length increases, the number of sequences decreases proportionally.

Figure 3 plots the relative running time of the Griffin model versus the MQA baseline model at 2048 sequence lengths.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

Inference speed

LLM’s inference consists of two stages. The "prefill" phase is to receive and process prompts. This step actually performs a forward pass on the model. Since prompts can be processed in parallel throughout the sequence, most model operations at this stage are computationally bound. Therefore, we expect the relative speed of Transformers and loop models in the pre-population stage to be the same as those previously discussed. Relative speeds during training were similar.

After pre-population is the decoding stage, in which the researcher autoregressively extracts tokens from the model. As shown below, especially for longer sequence lengths, where the key-value (KV) cache used in attention becomes large, the recurrent model has lower latency and higher throughput in the decoding stage.

There are two main metrics to consider when evaluating inference speed. The first is latency, which measures the time required to generate a specified number of tokens at a specific batch size. The second is throughput, which measures the maximum number of tokens that can be generated per second when sampling a specified number of tokens on a single device. Because throughput is calculated as the number of tokens sampled multiplied by the batch size divided by latency, you can increase throughput by reducing latency or reducing memory usage to use a larger batch size on the device. Taking latency into account is useful for real-time applications that require fast response times. Throughput is also worth considering as it tells us the maximum number of tokens that can be sampled from a particular model in a given time. This property is attractive when considering other language applications, such as reinforcement learning based on human feedback (RLHF) or scoring language model output (as done in AlphaCode), because being able to output a large number of tokens in a given time is An attractive feature.

Here, the researchers studied the inference results of the model with parameter 1B. In terms of baselines, they are compared with the MQA Transformer, which is significantly faster during inference than the standard MHA Transformer commonly used in the literature. The models compared by the researchers are: i) MQA converter, ii) Hawk and iii) Griffin. To compare different models, we report latency and throughput.

As shown in Figure 4, the researchers compared the latency of the model with a batch size of 16, empty pre-filling and pre-filling 4096 tokens.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

Figure 1(b) compares the maximum throughput (tokens/second) of the same models when sampling 512, 1024, 2048 and 4196 tokens respectively after empty hints.

Long context modeling

This paper also explores the effectiveness of Hawk and Griffin using longer contexts to improve next token predictions, and investigates their ability to extrapolate during inference. Griffin's performance on tasks requiring copying and retrieval abilities is also explored, both in models trained on such tasks and when these abilities are tested using pretrained language models.

From the graph on the left side of Figure 5, it can be observed that within a certain maximum length range, both Hawk and Griffin can improve the prediction ability of the next token in a longer context, and they overall Able to infer longer sequences (at least 4 times) than when trained. Griffin, in particular, performs very well in reasoning even when using RoPE in the local attention layer.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

As shown in Figure 6, in the selective copying task, all 3 models can complete the task perfectly. When comparing learning speed on this task, Hawk is significantly slower than Transformer, which is similar to the observations of Jelassi et al. (2024) who found that Mamba learned significantly slower on a similar task. Interestingly, even though Griffin only uses a local attention layer, its learning speed is barely slowed down and is on par with Transformer's learning speed.

Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama

For more details, please read the original paper.

Atas ialah kandungan terperinci Kecekapan RNN adalah setanding dengan Transformer, seni bina baharu Google mempunyai dua keluaran berturut-turut: ia lebih kuat daripada Mamba pada skala yang sama. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!

Kenyataan:
Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn