


RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the world
While large models are being rolled in, Transformer’s status is also being challenged one after another.
Recently, RWKV released the Eagle 7B model, based on the latest RWKV-v5 architecture.
Eagle 7B excels in multilingual benchmarks and is on par with top models in English tests.
At the same time, Eagle 7B uses the RNN architecture. Compared with the Transformer model of the same size, the inference cost is reduced by more than 10-100 times. It can be said to be the most environmentally friendly 7B in the world. Model.
Since the RWKV-v5 paper may not be released until next month, we first provide the RWKV paper, which is the first non-Transformer architecture to scale parameters to tens of billions.
Picture
Paper address: https://arxiv.org/pdf/2305.13048.pdf
This work was accepted by EMNLP 2023. The authors come from top universities, research institutions and technology companies around the world.
The following is the official picture of Eagle 7B, showing that the eagle is flying over Transformers.
Pictures
Eagle 7B
Eagle 7B is available in over 100 languages With 1.1T (trillion) Token training data, Eagle 7B ranked first in average score in the multi-language benchmark test in the figure below.
Benchmarks include xLAMBDA, xStoryCloze, xWinograd, and xCopa, covering 23 languages, as well as commonsense reasoning in their respective languages.
Eagle 7B won the first place in three of them. Although one of them did not beat Mistral-7B and ranked second, the training data used by the opponent was much higher than Eagle.
Picture
The English test pictured below contains 12 separate benchmarks, common sense reasoning, and world knowledge.
In the English performance test, the level of Eagle 7B is close to Falcon (1.5T), LLaMA2 (2T), and Mistral (>2T), which also uses about 1T training data. MPT-7B is comparable.
Picture
And, in both tests, the new v5 architecture has huge improvements compared to the previous v4 overall leap.
Eagle 7B is currently hosted by the Linux Foundation and is licensed under the Apache 2.0 license for unrestricted personal or commercial use.
Multi-language support
As mentioned earlier, the training data of Eagle 7B comes from more than 100 languages, and the 4 multi-languages used above The benchmark only included 23 languages.
Picture
Although it achieved first place, in general, Eagle 7B suffered a loss, after all , the benchmark cannot directly evaluate the model's performance in more than 70 other languages.
The extra training cost will not help you improve your rankings. If you focus on English, you may get better results than you do now.
——So, why did RWKV do this? The official said:
Building inclusive AI for everyone in this world —— not just the English
##In response to the numerous feedback on the RWKV model Among them, the most common are:
Multilingual approaches hurt the model’s English evaluation score and slow down the development of the linear Transformer;
It is unfair to compare the multilingual performance of multilingual models with pure English models
The official said, "In most cases, we agree with these opinions,"
"But we have no plans to change that, because we are building artificial intelligence for the world - and it's not just an English-speaking world."
Picture
In 2023, only 17% of the world’s population will speak English (approximately 1.3 billion people), however, by supporting the top 25 languages in the world, the model can cover approximately 40 billion people, or 50% of the world’s total population.
The team hopes that future artificial intelligence can help everyone, such as allowing models to run on low-end hardware at a low price, such as supporting more languages.
The team will gradually expand the multilingual data set to support a wider range of languages, and slowly expand the coverage to 100% of the world's regions, - ensuring that there is no language was left out.
Data set scalable architecture
During the training process of the model, there is a phenomenon worth noting:
As the size of training data continues to increase, the performance of the model gradually improves. When the training data reaches about 300B, the model shows similar performance to pythia-6.9b, while the latter's training data volume is 300B.
Picture
This phenomenon is the same as a previous experiment conducted on the RWKV-v4 architecture - that is to say , when the size of the training data is the same, the performance of a linear Transformer like RWKV will be similar to that of the Transformer.
So we can’t help but ask, if this is indeed the case, is the data more important to the performance improvement of the model than the exact architecture?
Picture
We know that the calculation and storage cost of the Transformer class model is square, and in the picture above The computational cost of the RWKV architecture only increases linearly with the number of Tokens.
Perhaps we should pursue more efficient and scalable architectures that increase accessibility, lower the cost of AI for everyone, and reduce environmental impact.
RWKV
The RWKV architecture is an RNN with GPT-level LLM performance, and at the same time can be trained in parallel like Transformer.
RWKV combines the advantages of RNN and Transformer - excellent performance, fast inference, fast training, VRAM saving, "unlimited" context length and free sentence embedding. RWKV does not Use attention mechanism.
The following figure shows the comparison of computational costs between RWKV and Transformer models:
Picture
In order to solve the time and space complexity problems of Transformer, researchers have proposed a variety of architectures:
Picture
The RWKV architecture consists of a series of stacked residual blocks, each residual block consists of a time mixing with a loop structure and a channel mixing sub-block
In the figure below RWKV block elements on the left, RWKV residual block on the right, and the final header for language modeling.
Picture
Recursion can be expressed as linear interpolation between the current input and the input of the previous time step (as shown in the figure below ), can be adjusted independently for each linear projection of the input embedding.
A vector that handles the current Token separately is also introduced here to compensate for potential degradation.
Picture
RWKV can be efficiently parallelized (matrix multiplication) in what we call temporal parallelism mode.
In a recurrent network, the output of the previous moment is usually used as the input of the current moment. This is particularly evident in autoregressive decoding inference for language models, which requires each token to be computed before inputting the next step, allowing RWKV to take advantage of its RNN-like structure, called temporal mode.
In this case, RWKV can be conveniently formulated recursively for decoding during inference, which takes advantage of each output token relying only on the latest state, state The size of is constant regardless of sequence length.
Then then acts as an RNN decoder, yielding constant speed and memory footprint relative to sequence length, enabling longer sequences to be processed more efficiently.
In contrast, self-attention’s KV cache grows continuously relative to the sequence length, resulting in decreased efficiency and increased memory usage and time as the sequence lengthens.
Reference:
The above is the detailed content of RNN model challenges Transformer hegemony! 1% cost and performance comparable to Mistral-7B, supporting 100+ languages, the most in the world. For more information, please follow other related articles on the PHP Chinese website!

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software