Home  >  Article  >  Technology peripherals  >  RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the world

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the world

WBOY
WBOYforward
2024-02-19 21:30:39810browse

While large models are being rolled in, Transformer’s status is also being challenged one after another.

Recently, RWKV released the Eagle 7B model, based on the latest RWKV-v5 architecture.

Eagle 7B excels in multilingual benchmarks and is on par with top models in English tests.

At the same time, Eagle 7B uses the RNN architecture. Compared with the Transformer model of the same size, the inference cost is reduced by more than 10-100 times. It can be said to be the most environmentally friendly 7B in the world. Model.

Since the RWKV-v5 paper may not be released until next month, we first provide the RWKV paper, which is the first non-Transformer architecture to scale parameters to tens of billions.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

Paper address: https://arxiv.org/pdf/2305.13048.pdf

This work was accepted by EMNLP 2023. The authors come from top universities, research institutions and technology companies around the world.

The following is the official picture of Eagle 7B, showing that the eagle is flying over Transformers.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPictures

Eagle 7B

Eagle 7B is available in over 100 languages With 1.1T (trillion) Token training data, Eagle 7B ranked first in average score in the multi-language benchmark test in the figure below.

Benchmarks include xLAMBDA, xStoryCloze, xWinograd, and xCopa, covering 23 languages, as well as commonsense reasoning in their respective languages.

Eagle 7B won the first place in three of them. Although one of them did not beat Mistral-7B and ranked second, the training data used by the opponent was much higher than Eagle.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

The English test pictured below contains 12 separate benchmarks, common sense reasoning, and world knowledge.

In the English performance test, the level of Eagle 7B is close to Falcon (1.5T), LLaMA2 (2T), and Mistral (>2T), which also uses about 1T training data. MPT-7B is comparable.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

And, in both tests, the new v5 architecture has huge improvements compared to the previous v4 overall leap.

Eagle 7B is currently hosted by the Linux Foundation and is licensed under the Apache 2.0 license for unrestricted personal or commercial use.

Multi-language support

As mentioned earlier, the training data of Eagle 7B comes from more than 100 languages, and the 4 multi-languages ​​used above The benchmark only included 23 languages.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

Although it achieved first place, in general, Eagle 7B suffered a loss, after all , the benchmark cannot directly evaluate the model's performance in more than 70 other languages.

The extra training cost will not help you improve your rankings. If you focus on English, you may get better results than you do now.

——So, why did RWKV do this? The official said:

Building inclusive AI for everyone in this world —— not just the English

##In response to the numerous feedback on the RWKV model Among them, the most common are:

Multilingual approaches hurt the model’s English evaluation score and slow down the development of the linear Transformer;

It is unfair to compare the multilingual performance of multilingual models with pure English models

The official said, "In most cases, we agree with these opinions,"

"But we have no plans to change that, because we are building artificial intelligence for the world - and it's not just an English-speaking world."

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

In 2023, only 17% of the world’s population will speak English (approximately 1.3 billion people), however, by supporting the top 25 languages ​​in the world, the model can cover approximately 40 billion people, or 50% of the world’s total population.

The team hopes that future artificial intelligence can help everyone, such as allowing models to run on low-end hardware at a low price, such as supporting more languages.

The team will gradually expand the multilingual data set to support a wider range of languages, and slowly expand the coverage to 100% of the world's regions, - ensuring that there is no language was left out.

Data set scalable architecture

During the training process of the model, there is a phenomenon worth noting:

As the size of training data continues to increase, the performance of the model gradually improves. When the training data reaches about 300B, the model shows similar performance to pythia-6.9b, while the latter's training data volume is 300B.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

This phenomenon is the same as a previous experiment conducted on the RWKV-v4 architecture - that is to say , when the size of the training data is the same, the performance of a linear Transformer like RWKV will be similar to that of the Transformer.

So we can’t help but ask, if this is indeed the case, is the data more important to the performance improvement of the model than the exact architecture?

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

We know that the calculation and storage cost of the Transformer class model is square, and in the picture above The computational cost of the RWKV architecture only increases linearly with the number of Tokens.

Perhaps we should pursue more efficient and scalable architectures that increase accessibility, lower the cost of AI for everyone, and reduce environmental impact.

RWKV

The RWKV architecture is an RNN with GPT-level LLM performance, and at the same time can be trained in parallel like Transformer.

RWKV combines the advantages of RNN and Transformer - excellent performance, fast inference, fast training, VRAM saving, "unlimited" context length and free sentence embedding. RWKV does not Use attention mechanism.

The following figure shows the comparison of computational costs between RWKV and Transformer models:

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

In order to solve the time and space complexity problems of Transformer, researchers have proposed a variety of architectures:

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

The RWKV architecture consists of a series of stacked residual blocks, each residual block consists of a time mixing with a loop structure and a channel mixing sub-block

In the figure below RWKV block elements on the left, RWKV residual block on the right, and the final header for language modeling.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

Recursion can be expressed as linear interpolation between the current input and the input of the previous time step (as shown in the figure below ), can be adjusted independently for each linear projection of the input embedding.

A vector that handles the current Token separately is also introduced here to compensate for potential degradation.

RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the worldPicture

RWKV can be efficiently parallelized (matrix multiplication) in what we call temporal parallelism mode.

In a recurrent network, the output of the previous moment is usually used as the input of the current moment. This is particularly evident in autoregressive decoding inference for language models, which requires each token to be computed before inputting the next step, allowing RWKV to take advantage of its RNN-like structure, called temporal mode.

In this case, RWKV can be conveniently formulated recursively for decoding during inference, which takes advantage of each output token relying only on the latest state, state The size of is constant regardless of sequence length.

Then then acts as an RNN decoder, yielding constant speed and memory footprint relative to sequence length, enabling longer sequences to be processed more efficiently.

In contrast, self-attention’s KV cache grows continuously relative to the sequence length, resulting in decreased efficiency and increased memory usage and time as the sequence lengthens.

Reference:

https://www.php.cn/link/fda2217a3921c464be73975603df7510

The above is the detailed content of RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the world. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete