Transformers have transformed artificial intelligence, offering unmatched performance in NLP, computer vision, and multi-modal data integration. These models excel at identifying patterns within data through their attention mechanisms, making them ideal for complex tasks. However, the rapid scaling of transformer models needs to be improved because of the high computational cost associated with their traditional structure.
Transformers have revolutionized artificial intelligence, offering unparalleled performance in natural language processing (NLP), computer vision, and multi-modal data integration. These models excel at identifying patterns within data through their attention mechanisms, making them ideal for complex tasks. However, the rapid scaling of transformer models needs to be improved because of the high computational cost associated with their traditional structure. As these models grow, they demand significant hardware resources and training time, which increases exponentially with the model size.
The primary obstacle in scaling transformers lies in the fixed parameters within their linear projection layers. This static structure limits the model’s ability to expand without being entirely retrained, which becomes exponentially more expensive as model sizes increase. These traditional models typically demand comprehensive retraining when architectural modifications occur, such as increasing channel dimensions.
Consequently, the computational cost for these expansions grows impractically high, and the approach lacks flexibility. The inability to add new parameters dynamically stifles growth, rendering these models less adaptable to evolving AI applications and more costly in terms of time and resources.
Historically, approaches to managing model scalability included duplicating weights or restructuring models using methods like Net2Net, where duplicating neurons expand layers. However, these approaches often disrupt the balance of pre-trained models, resulting in slower convergence rates and additional training complexities.
While these methods have made incremental progress, they still face limitations in preserving model integrity during scaling. Transformers rely heavily on static linear projections, making parameter expansion expensive and inflexible. Traditional models like GPT and other large transformers often retrain from scratch, incurring high computational costs with each new scaling stage.
Now, researchers at the Max Planck Institute, Google, and Peking University have developed a new architecture called Tokenformer that fundamentally reimagines transformers by treating model parameters as tokens, allowing for dynamic interactions between tokens and parameters.
In this framework, Tokenformer introduces a novel component called the token-parameter attention (Pattention) layer, which facilitates incremental scaling. The model can add new parameter tokens without retraining, drastically reducing training costs.
By representing input tokens and parameters within the same framework, Tokenformer allows for flexible scaling, providing researchers with a more efficient, resource-conscious model architecture that retains scalability and high performance.
Tokenformer’s Pattention layer uses input tokens as queries, while model parameters serve as keys and values, which differs from the standard transformer approach, relying solely on linear projections.
The model’s scaling is achieved by adding new key-value parameter pairs, keeping input and output dimensions constant, and avoiding full retraining. Tokenformer’s architecture is designed to be modular, enabling researchers to expand the model seamlessly by incorporating additional tokens.
This incremental scaling capability supports the efficient reuse of pre-trained weights while enabling rapid adaptation for new datasets or larger model sizes without disrupting learned information.
The performance benefits of Tokenformer are notable, as the model significantly reduces computational costs while maintaining accuracy. For instance, Tokenformer scaled from 124 million to 1.4 billion parameters with only half the typical training costs traditional transformers require.
In one experiment, the model achieved a test perplexity of 11.77 for a 1.4 billion parameter configuration, nearly matching the 11.63 perplexity of a similarly sized transformer trained from scratch.
This efficiency means Tokenformer can achieve high performance across multiple domains, including language and visual modeling tasks, at a fraction of the resource expenditure of traditional models.
Tokenformer presents numerous key takeaways for advancing AI research and improving transformer-based models. These include:
Treating parameters as tokens enables incremental model scaling without retraining.
The token-parameter attention layer facilitates efficient parameter expansion.
Modular architecture supports seamless model growth by incorporating additional tokens.
The model achieves high performance across diverse domains with minimal resource expenditure.
In conclusion, Tokenformer offers a transformative approach to scaling transformer-based models. This model architecture achieves scalability and resource efficiency by treating parameters as tokens, reducing costs, and preserving model performance across tasks.
This flexibility represents a breakthrough in transformer design, providing a model that can adapt to the demands of advancing AI applications without retraining. Tokenformer’s architecture holds promise for future AI research, offering a pathway to develop large-scale models sustainably and efficiently.
Check out the Paper, GitHub Page, and Models on HuggingFace.
All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter. Don’t Forget to join our 55k ML SubReddit.
[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million Monthly Readers and 500k Community Members
The above is the detailed content of Tokenformer: Rethinking Transformers by Treating Parameters as Tokens. For more information, please follow other related articles on the PHP Chinese website!

Cryptocurrency has always been a realm where the cutting edge of technology meets bold ambition, and it's only getting more exciting in the future. As artificial intelligence continues to grow in influence, there are a handful of digital assets that

This article reviews the ten-year price trend of Bitcoin from 2015 to 2025 in detail. Data shows that Bitcoin price fluctuates dramatically, experiencing huge changes from $200 to over $100,000. During this period, the price of Bitcoin was affected by a variety of factors, including halving of block rewards, market sentiment, regulatory policies, and global macroeconomic situation. The article analyzes the rise and fall of Bitcoin prices year by year, and focuses on interpreting the price changes in key years, providing a reference for investors to understand the history of Bitcoin prices and predict future trends. Keywords: Bitcoin price, Bitcoin trend, Bitcoin decade, digital currency, cryptocurrency
![Bitcoin [BTC] was on a downtrend after losing the $92,000-support level in the final week of February](https://img.php.cn/upload/article/001/246/273/174209101774967.jpg?x-oss-process=image/resize,p_40)
Technical indicators such as the OBV showed that selling pressure has been dominant, meaning more losses may be likely ahead.

The top ten free virtual currency exchanges are ranked: 1. OKX; 2. Binance; 3. Gate.io; 4. Huobi Global; 5. Kraken; 6. Coinbase; 7. KuCoin; 8. Crypto.com; 9. MEXC Global; 10. Bitfinex. These platforms each have their own advantages.

This article reviews the price trend of Ethereum since its listing in 2015, from the initial $0.31, it experienced a surge in 2017 to nearly $1,400, as well as a market plunge in 2018 and 2022, and then hit a record high of $4,891.70 in 2021, as well as a rebound and stability in 2023. The article data covers the significant changes in Ethereum prices over each year and predicts price trends for 2024-2025, providing investors with a comprehensive historical reference and future outlook for Ethereum prices. Understand the history of Ethereum price fluctuations and seize investment opportunities!

Since then, the provider has been investigating how this could have happened and how it will (hopefully) not happen again in the future.

Top 10 digital currency app platforms: 1. OKX, 2. Binance, 3. Gate.io, 4. Kraken, 5. Coinbase, 6. Huobi, 7. KuCoin, 8. Crypto.com, 9. Bitfinex, 10. Gemini; these platforms are ranked according to factors such as transaction volume, security and user experience. When choosing, the platform's security, liquidity, transaction fees, currency selection, user interface and customer support should be considered.

1. Enter the web version of okx Euyi Exchange ☜☜☜☜☜☜ Click to save 2. Click the link of okx Euyi Exchange app ☜☜☜☜ Click to save 3. After entering the official website, the clear interface provides a login and registration portal. Users can choose to log in to an existing account or register a new account according to their own situation. Whether it is viewing real-time market conditions, conducting transactions, or managing assets, the OKX web version provides a simple and smooth operating experience, suitable for beginners and veterans. Visit OKX official website now for easy experience

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Notepad++7.3.1
Easy-to-use and free code editor