When everyone continues to upgrade and iterate their own large models, the ability of LLM (large language model) to process context windows has also become an important evaluation indicator.
For example, the star large model GPT-4 supports 32k tokens, equivalent to 50 pages of text; Anthropic, founded by a former member of OpenAI, has increased Claude's token processing capabilities to 100k, about 75,000 One word is roughly equivalent to summarizing the first part of "Harry Potter" in one click.
In Microsoft’s latest research, they directly expanded Transformer to 1 billion tokens this time. This opens up new possibilities for modeling very long sequences, such as treating an entire corpus or even the entire Internet as one sequence.
For comparison, the average person can read 100,000 tokens in about 5 hours, and may take longer to digest, remember, and analyze the information. Claude can do this in less than a minute. If converted into this research by Microsoft, it would be a staggering number.
Picture
- Paper address: https://arxiv.org/pdf/2307.02486.pdf
- Project address: https://github.com/microsoft/unilm/tree/master
Specifically, the study proposes LONGNET, a Transformer variant that can extend sequence length to over 1 billion tokens without sacrificing performance for shorter sequences. The article also proposes dilated attention, which can exponentially expand the model's perception range.
LONGNET has the following advantages:
1) It has linear computational complexity;
2) It can be used as a distributed trainer for longer sequences;
3) dilated attention can seamlessly replace standard attention and work seamlessly with existing Transformer-based optimization methods integrated.
Experimental results show that LONGNET exhibits strong performance in both long sequence modeling and general language tasks.
In terms of research motivation, the paper states that in recent years, extending neural networks has become a trend, and many networks with good performance have been studied. Among them, the sequence length, as part of the neural network, should ideally be infinite. But the reality is often the opposite, so breaking the limit of sequence length will bring significant advantages:
- First, it provides a large-capacity memory and receptive field for the model. Enable it to interact effectively with humans and the world.
- Secondly, longer context contains more complex causal relationships and reasoning paths that the model can exploit in the training data. On the contrary, shorter dependencies will introduce more spurious correlations, which is not conducive to the generalization of the model.
- Third, longer sequence length can help the model explore longer contexts, and extremely long contexts can also help the model alleviate the catastrophic forgetting problem.
#However, the main challenge in extending sequence length is finding the right balance between computational complexity and model expressive power.
For example, RNN style models are mainly used to increase sequence length. However, its sequential nature limits parallelization during training, which is crucial in long sequence modeling.
Recently, state space models have become very attractive for sequence modeling, which can be run as a CNN during training and converted to an efficient RNN at test time. However, this type of model does not perform as well as Transformer at regular lengths.
Another way to extend the sequence length is to reduce the complexity of the Transformer, that is, the quadratic complexity of self-attention. At this stage, some efficient Transformer-based variants have been proposed, including low-rank attention, kernel-based methods, downsampling methods, and retrieval-based methods. However, these approaches have yet to scale Transformer to the scale of 1 billion tokens (see Figure 1).
Picture
The following table compares the computational complexity of different calculation methods. N is the sequence length, and d is the hidden dimension.
picture
Method
The study’s solution, LONGNET, successfully extended the sequence length to 1 billion tokens. Specifically, this research proposes a new component called dilated attention and replaces the attention mechanism of Vanilla Transformer with dilated attention. A general design principle is that the allocation of attention decreases exponentially as the distance between tokens increases. The study shows that this design approach obtains linear computational complexity and logarithmic dependence between tokens. This resolves the conflict between limited attention resources and access to every token.
Picture
During the implementation process, LONGNET can be converted into a dense Transformer to seamlessly support existing Transformer-specific There are optimization methods (such as kernel fusion, quantization and distributed training). Taking advantage of linear complexity, LONGNET can be trained in parallel across nodes, using distributed algorithms to break computing and memory constraints.
In the end, the research effectively expanded the sequence length to 1B tokens, and the runtime was almost constant, as shown in the figure below. In contrast, the vanilla Transformer's runtime suffers from quadratic complexity.
This research further introduces the multi-head dilated attention mechanism. As shown in Figure 3 below, this study performs different computations across different heads by sparsifying different parts of query-key-value pairs.
Picture
Distributed training
Although the computational complexity of dilated attention has been greatly reduced to, due to computing and memory limitations, it is not feasible to extend the sequence length to millions of levels on a single GPU device of. There are some distributed training algorithms for large-scale model training, such as model parallelism [SPP 19], sequence parallelism [LXLY21, KCL 22] and pipeline parallelism [HCB 19]. However, these methods are not enough for LONGNET, especially is when the sequence dimension is very large.
This research utilizes the linear computational complexity of LONGNET for distributed training of sequence dimensions. Figure 4 below shows the distributed algorithm on two GPUs, which can be further scaled to any number of devices.
##Experiment
This research will LONGNET Comparisons were made with vanilla Transformer and sparse Transformer. The difference between the architectures is the attention layer, while the other layers remain the same. The researchers expanded the sequence length of these models from 2K to 32K, while reducing the batch size to ensure that the number of tokens in each batch remained unchanged.
Table 2 summarizes the results of these models on the Stack dataset. Research uses complexity as an evaluation metric. The models were tested using different sequence lengths, ranging from 2k to 32k. When the input length exceeds the maximum length supported by the model, the research implements blockwise causal attention (BCA) [SDP 22], a state-of-the-art extrapolation method for language model inference.
In addition, the study removed absolute position encoding. First, the results show that increasing sequence length during training generally results in better language models. Second, the sequence length extrapolation method in inference does not apply when the length is much larger than the model supports. Finally, LONGNET consistently outperforms baseline models, demonstrating its effectiveness in language modeling.
Expansion curve of sequence length
Figure 6 plots the sequence length expansion curves of vanilla transformer and LONGNET. This study estimates the computational effort by counting the total flops of matrix multiplications. The results show that both vanilla transformer and LONGNET achieve larger context lengths from training. However, LONGNET can extend the context length more efficiently, achieving lower test loss with less computation. This demonstrates the advantage of longer training inputs over extrapolation. Experiments show that LONGNET is a more efficient way to extend the context length in language models. This is because LONGNET can learn longer dependencies more efficiently.
##Expand model size
An important property of large language models is that the loss expands in a power law as the amount of calculation increases. To verify whether LONGNET still follows similar scaling rules, the study trained a series of models with different model sizes (from 125 million to 2.7 billion parameters). 2.7 billion models were trained with 300B tokens, while the remaining models used approximately 400B tokens. Figure 7 (a) plots the expansion curve of LONGNET with respect to computation. The study calculated the complexity on the same test set. This proves that LONGNET can still follow a power law. This also means that dense Transformer is not a prerequisite for extending language models. Additionally, scalability and efficiency are gained with LONGNET.
##Long context prompt
Prompt Yes An important way to bootstrap language models and provide them with additional information. This study experimentally validates whether LONGNET can benefit from longer context hint windows.This study retains a prefix (prefixes) as a prompt and tests the perplexity of its suffixes (suffixes). Moreover, during the research process, the prompt was gradually expanded from 2K to 32K. To make a fair comparison, the length of the suffix is kept constant while the length of the prefix is increased to the maximum length of the model. Figure 7(b) reports the results on the test set. It shows that the test loss of LONGNET gradually decreases as the context window increases. This proves the superiority of LONGNET in fully utilizing long context to improve language models.
The above is the detailed content of Microsoft's new hot paper: Transformer expands to 1 billion tokens. For more information, please follow other related articles on the PHP Chinese website!
![如何在任务栏上显示互联网速度[简单步骤]](https://img.php.cn/upload/article/000/465/014/169088173253603.png)
互联网速度是决定在线体验结果的重要参数。无论是文件下载或上传,还是只是浏览网页,我们都需要一个体面的互联网连接。这就是为什么用户寻找在任务栏上显示互联网速度的方法。将网络速度显示在任务栏中允许用户快速监控事物,无论手头的任务是什么。任务栏始终可见,除非您处于全屏模式。但是Windows不提供在任务栏中显示互联网速度的本机选项。这就是为什么您需要第三方工具的原因。继续阅读以了解有关最佳选择的所有信息!如何在Windows命令行中运行速度测试?按+打开“运行”,键入电源外壳,然后按++。Window

在具有网络连接的安全模式下,Windows11计算机上没有互联网连接可能会令人沮丧,尤其是在诊断和排除系统问题时。在本指南中,我们将讨论问题的潜在原因,并列出有效的解决方案,以确保您在安全模式下可以访问互联网。为什么在带网络连接的安全模式下没有互联网?网络适配器不兼容或未正确加载。第三方防火墙、安全软件或防病毒软件可能会干扰安全模式下的网络连接。网络服务未运行。恶意软件感染如果互联网无法在Windows11的安全模式下使用网络,我该怎么办?在执行高级故障排除步骤之前,应考虑执行以下检查:请确保使

每一台主机都有唯一的地址标识称为“IP地址”。IP地址是IP协议提供的一种统一的地址格式,它为互联网上的每一个网络和每一台主机分配一个唯一的逻辑地址,以此来屏蔽物理地址的差异。由于有这种唯一的地址,才保证了用户在连网的计算机上操作时,能够高效而且方便地从千千万万台计算机中选出自己所需的对象来。

Roblox不起作用:原因是什么?凭借其广泛的游戏选择和活跃的社区,著名的在线游戏平台Roblox赢得了全球数百万粉丝。但是,Roblox可能偶尔会遇到技术问题,就像任何复杂的数字平台一样。下面,我们将研究一些可能修复您的Roblox无法正常工作错误的修复程序。让我们切入正题,从第一件事开始!检查Roblox服务器状态由于Roblox是一款在线游戏,如果服务中断,您可能会遇到启动它时遇到的困难。使Roblox的当前服务器状态和操作正常运行。如果服务器脱机进行维护,请等待服务器端的问题得到解决。有

当大家不断升级迭代自家大模型的时候,LLM(大语言模型)对上下文窗口的处理能力,也成为一个重要评估指标。比如明星大模型GPT-4支持32ktoken,相当于50页的文字;OpenAI前成员创立的Anthropic更是将Claude处理token能力提升到100k,约75000个单词,大概相当于一键总结《哈利波特》第一部。在微软最新的一项研究中,他们这次直接将Transformer扩展到10亿token。这为建模非常长的序列开辟了新的可能性,例如将整个语料库甚至整个互联网视为一个序列。作为比较,普

互联网思维的核心是“用户思维”。人是互联网时代的核心,用户思维自然也成为互联网思维的核心,而其他思维,都是围绕这个思维展开的;用户思维是互联网思维的基石,没有用户思维就不会有其他的互联网思维。

已连接但无法访问互联网解决方法:1、检查网络连接是否正常,尝试重新启动我们的路由器或调制解调器,以确保它们正常工作;2、检查设备是否正确连接到网络,并且是否配置了正确的IP地址和DNS服务器;3、使用其他设备连接到同一网络,如果能正常访问,那么可以尝试更新设备的操作系统或重置设备的网络设置来解决问题;4、如果以上方法都没有解决问题,可以联系互联网服务提供商寻求帮助。

哈喽,大家好。关注渡码的老读者都能体会到,渡码公众号的文章从去年中旬开始转向人工智能的方向。因为当时我认定了人工智能就是未来,逻辑也很简单,互联网时代把人们从看报纸、看电视的场景中解放出来,PC时代把人们从机房、网吧场景中解放出来,移动互联网把人们从电脑桌上解放出来。而人工智能天然地会把人们从各种各样的场景中解放出来。今天要写的是最近爆火的ChatGPT,大家看完文章可以亲手试试,看看有哪些场景可以被它解放了。准备了 6 个 chatgpt 账号,大家可以免费使用,获取方式放在文末了。1. 注册


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

WebStorm Mac version
Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version
Chinese version, very easy to use

Dreamweaver Mac version
Visual web development tools