Home > Article > Technology peripherals > Introducing RWKV: The rise of linear Transformers and exploring alternatives
Here is a summary of some of my thoughts on the RWKV podcast: https://www.php. cn/link/9bde76f262285bb1eaeb7b40c758b53e
Why is the importance of alternatives so prominent?
With the artificial intelligence revolution in 2023, the Transformer architecture is currently at its peak. However, in the rush to adopt the successful Transformer architecture, it is easy to overlook the alternatives that can be learned from.
#As engineers, we should not take a one-size-fits-all approach and use the same solution to every problem. We should weigh the pros and cons in every situation; otherwise being trapped within the limitations of a particular platform while feeling "satisfied" by not knowing there are alternatives could turn development back to pre-liberation overnight
This problem is not unique to the field of artificial intelligence, but a historical pattern that has been repeated from ancient times to the present.
. In this story, various database management systems, such as Oracle, MySQL, and SQL Server, compete fiercely for market share and technical advantages. These competitions are not only reflected in performance and functionality, but also involve many aspects such as business strategy, marketing and user satisfaction. These database management systems are constantly introducing new features and improvements to attract more users and businesses to choose their products. A page in the history of the SQL war, which has witnessed the development and transformation of the database management system industry, and also provided us with valuable experience and lessons
Recently A notable example in software development is the NoSQL trend that emerged when SQL servers began to be physically constrained. Startups around the world are turning to NoSQL for "scale" reasons, even though they are nowhere near those scales # However, over time, as With the advent of eventual consistency and the management overhead of NoSQL, and the huge leap in hardware capabilities in terms of SSD speed and capacity, SQL servers have seen a comeback recently due to their simplicity of use and are now available in over 90% of startups Sufficient scalability
SQL and NoSQL are two different database technologies. SQL is the abbreviation of Structured Query Language, which is mainly used to process structured data. NoSQL refers to a non-relational database, suitable for processing unstructured or semi-structured data. While some people think that SQL is better than NoSQL, or vice versa, in reality it just means that each technology has its own pros, cons, and use cases. In some cases, SQL may be better suited for processing complex relational data, while NoSQL is better suited for processing large-scale unstructured data. However, this does not mean that only one technology can be chosen. In fact, many applications and systems use hybrid solutions of SQL and NoSQL in practice. Depending on the specific needs and data type, the most appropriate technology can be selected to solve the problem. Therefore, it is important to understand the characteristics and applicable scenarios of each technology and make an informed choice based on the specific situation. Both SQL and NoSQL have their own unique learning points and preferred use cases that can be learned from and cross-pollinated among similar technologies
Since O(N^ per token used/generated 2) The secondary calculation cost caused by the increase. This makes context sizes larger than 100,000 very expensive, affecting inference and training.
The context size limits the Attention mechanism, severely limiting "intelligent agent" use cases (such as smol-dev) and forcing a solution to the problem. Larger contexts require fewer workarounds.
So, how do we solve this problem?
##Introducing RWKV: a linear T
ransformer###### /Modern Large RNN#####################RWKV and Microsoft RetNet are the first in a new category called "Linear Transformers"##### ############# It directly addresses the above three limitations by supporting: ############As we continue to scale our AI models to 100k contexts and beyond size, the quadratic computational cost starts to grow exponentially.
However, linear Transformers did not abandon the recurrent neural network architecture and solve its bottlenecks, which forced their replacement.
#However, the redesigned RNN has learned the scalable lessons of Transformer, allowing RNN to work similarly to Transformer and eliminating these bottlenecks.
In terms of training speed, using Transformers brings them back into play - allowing them to run efficiently at O(N) cost while scaling in training More than 1 billion parameters while maintaining similar performance levels.
Chart: Linear Transformer computation cost linearly scaling per token versus exponential growth of the transformer
When you apply a square ratio to linear scaling, you get over 10x growth at 2k token count, at Obtained more than 100x growth at 100k token length
At 14B parameters, RWKV is the largest open source linear Transformer, comparable to GPT NeoX and other similar datasets (such as the Pile) is comparable.
The performance of the RWKV model is comparable to existing transformer models of similar size, Various benchmarks show
But in simpler terms, what does this mean?
advantage
Disadvantages
So while RWKV has not yet reached the 60B parameter scale of LLaMA2, with the right support and resources it has the potential to do so at lower cost and in a wider range of environments, especially as models tend to be smaller , more efficient case
Consider this if your use case is important for efficiency. However, this is not the final solution – the key lies in healthy alternatives
We should consider learning other Alternatives and their benefits
Diffusion model: Slower to train with text, but extremely resilient to multi-epoch training. Finding out why can help alleviate the token crisis.
Generative Adversarial Networks/Agents: Techniques can be used to train the desired training set to a specific target without a data set, even if it is based on Text model.
##Original title: Introducing RWKV: The Rise of Linear Transformers and Exploring Alternatives , Author: picocreator
##https://www.php.cn/link/b433da1b32b5ca96c0ba7fcb9edba97d
The above is the detailed content of Introducing RWKV: The rise of linear Transformers and exploring alternatives. For more information, please follow other related articles on the PHP Chinese website!