


More than a month ago, OpenAI’s GPT-4 came out. In addition to various excellent visual demonstrations, it also implements an important update: it can handle context tokens that are 8k in length by default, but can be up to 32K (approximately 50 pages of text). This means that when asking questions to GPT-4, we can enter much longer text than before. This greatly expands the application scenarios of GPT-4 and can better handle long conversations, long texts, and file search and analysis.
However, this record was quickly broken: CoLT5 from Google Research expanded the context token length that the model can handle to 64k .
Such a breakthrough is not easy, because these models using the Transformer architecture all face a problem: Transformer processing long documents is computationally very expensive, because the attention cost increases with the input The length grows quadratically, making it increasingly difficult to apply large models to longer inputs.
Despite this, researchers are still making breakthroughs in this direction. A few days ago, a study from the open source dialogue AI technology stack DeepPavlov and other institutions showed that:By using an architecture called Recurrent Memory Transformer (RMT), they can increase the effective context length of the BERT model to 2 million tokens (approximately equivalent to 3,200 pages of text according to OpenAI’s calculation method), while maintaining high memory retrieval accuracy (Note: Recurrent Memory Transformer was proposed by Aydar Bulatov et al. in a paper at NeurIPS 2022 Methods). The new method allows the storage and processing of local and global information, and the flow of information between segments of the input sequence through the use of recurrence.
The author stated that by using the simple token-based algorithm introduced by Bulatov et al. in the article "Recurrent Memory Transformer" Memory mechanism, they can combine RMT with pre-trained Transformer models such as BERT, and use an Nvidia GTX 1080Ti GPU to perform full attention and full precision operations on sequences of more than 1 million tokens.
Paper address: https://arxiv.org/pdf/2304.11062.pdf
However, some people have reminded that this is not a real "free lunch". The improvement of the above-mentioned papers is obtained by "longer reasoning time and substantial decrease in quality". . Therefore, it is not yet a revolution, but it may become the basis for the next paradigm (tokens may be infinitely long).
Recurrent Memory Transformer
This study adopts the method Recurrent Memory Transformer proposed by Bulatov et al. in 2022 (RMT) and change it to a plug-and-play method. The main mechanism is as shown in the figure below:
Long input is split into multiple segments, and a memory vector is added before the first segment embedding and processed together with the segment token. For pure encoder models like BERT, the memory is only added once at the beginning of the segment, unlike (Bulatov et al., 2022), where the pure decoder model divides the memory into read and write parts. For time step τ and segment
, the loop executes as follows:
Where, N is the number of layers of Transformer. After forward propagation,
# contains the updated memory token of segment τ.
The segments of the input sequence are processed in order. To enable circular connections, the study passes the output of the memory token from the current segment to the input of the next segment:
Both memory and looping in RMT are based only on global memory tokens. This allows the backbone Transformer to remain unchanged, making RMT's memory enhancement capabilities compatible with any Transformer model.
Computational efficiency
This study estimates the FLOPs required for RMT and Transformer models of different sizes and sequence lengths .
As shown in Figure 3 below, if the length of the segment is fixed, RMT can scale linearly for any model size. This study achieves linear scaling by dividing the input sequence into segments and computing the complete attention matrix only within segment boundaries.
Due to the high computational complexity of the FFN layer, larger Transformer models tend to exhibit slower quadratic scaling with sequence length. However, for very long sequences larger than 32000, they fall back to quadratic expansion. For sequences with more than one segment (> 512 in this study), RMT requires fewer FLOPs than acyclic models and can reduce the number of FLOPs by up to 295 times. RMT provides a greater relative reduction in FLOP for smaller models, but the 29x reduction in FLOP for the OPT-175B model is significant in absolute terms.
Memory Task
To test memory abilities, the study constructed synthetic datasets that required memorization of simple facts and basic reasoning. Task input consists of one or more facts and a question that can only be answered using all the facts. To increase the difficulty of the task, the study also added natural language text unrelated to the question or answer to act as noise, so the model was tasked with separating facts from irrelevant text and using the facts to answer the question.
Fact memory
First item The task is to test RMT's ability to write and store information in memory for long periods of time, as shown at the top of Figure 4 below. In the simplest case, the facts tend to be at the beginning of the input, and the questions are always at the end. The amount of irrelevant text between questions and answers gradually increases to the point where the entire input does not fit into a single model input.
##Fact Detection and Memory
Fact detection increases the difficulty of the task by moving a fact to a random position in the input, as shown in the middle of Figure 4 above. This requires the model to first distinguish the fact from irrelevant text, write the fact into memory, and then use it to answer the question at the end.
Using memorized facts to reason
Another operation of memory is to reason using memorized facts and the current context. To evaluate this functionality, the researchers used a more complex task in which two facts were generated and placed within an input sequence, as shown at the bottom of Figure 4 above. The question asked at the end of the sequence is described in such a way that arbitrary facts must be used to answer the question correctly.
Experimental results
The researchers used 4 to 8 NVIDIA 1080ti GPUs to train and evaluate the model . For longer sequences, they used a single 40GB NVIDIA A100 to speed up evaluation.
Course Learning
The researchers observed that using the training plan can significantly improve the accuracy and accuracy of the solutions. stability. Initially, RMT is trained on a shorter version of the task and increases the task length by adding another segment as the training converges. The course learning process continues until the required input length is reached.
In the experiment, the researchers first started with a sequence suitable for a single segment. The actual segment size is 499, but due to BERT's 3 special tokens and 10 memory placeholders retained from the model input, the size is 512. They note that after training on a shorter task, RMT is easier to solve longer versions of the task, thanks to the fact that it uses fewer training steps to converge to a perfect solution.
Extrapolation ability
What is the generalization ability of RMT to different sequence lengths? To answer this question, the researchers evaluated models trained on different numbers of segments to solve longer tasks, as shown in Figure 5 below.
They observed that models tend to perform better on shorter tasks, with the only exception being the single-segment inference task, which becomes Very difficult to solve. One possible explanation is that because the task size exceeds one segment, the model no longer "expects" problems in the first segment, resulting in a decrease in quality.
Interestingly, RMT’s ability to generalize to longer sequences also appears as the number of training segments increases. After training on 5 or more segments, RMT can generalize nearly perfectly to tasks twice as long.
In order to test the limitations of generalization, the researchers increased the verification task size to 4096 segments or 2,043,904 tokens (as shown in Figure 1 above). RMT performed on such a long sequence Performed surprisingly well. Detection and memory tasks are the simplest, and reasoning tasks are the most complex.
Attention pattern of memory operations
In Figure 6 below, by examining the RMT attention on a specific segment, the researcher observed that the memory operation corresponds to specific attention. model. Furthermore, the high extrapolation performance on extremely long sequences in Section 5.2 demonstrates the effectiveness of the learned memory operations, even when used thousands of times.
Please refer to the original paper for more technical and experimental details.
The above is the detailed content of True Quantum Speed Reading: Breaking through the limit of GPT-4 that can only understand 50 pages of text at a time, new research extends to millions of tokens. For more information, please follow other related articles on the PHP Chinese website!

ai合并图层的快捷键是“Ctrl+Shift+E”,它的作用是把目前所有处在显示状态的图层合并,在隐藏状态的图层则不作变动。也可以选中要合并的图层,在菜单栏中依次点击“窗口”-“路径查找器”,点击“合并”按钮。

ai橡皮擦擦不掉东西是因为AI是矢量图软件,用橡皮擦不能擦位图的,其解决办法就是用蒙板工具以及钢笔勾好路径再建立蒙板即可实现擦掉东西。

虽然谷歌早在2020年,就在自家的数据中心上部署了当时最强的AI芯片——TPU v4。但直到今年的4月4日,谷歌才首次公布了这台AI超算的技术细节。论文地址:https://arxiv.org/abs/2304.01433相比于TPU v3,TPU v4的性能要高出2.1倍,而在整合4096个芯片之后,超算的性能更是提升了10倍。另外,谷歌还声称,自家芯片要比英伟达A100更快、更节能。与A100对打,速度快1.7倍论文中,谷歌表示,对于规模相当的系统,TPU v4可以提供比英伟达A100强1.

ai可以转成psd格式。转换方法:1、打开Adobe Illustrator软件,依次点击顶部菜单栏的“文件”-“打开”,选择所需的ai文件;2、点击右侧功能面板中的“图层”,点击三杠图标,在弹出的选项中选择“释放到图层(顺序)”;3、依次点击顶部菜单栏的“文件”-“导出”-“导出为”;4、在弹出的“导出”对话框中,将“保存类型”设置为“PSD格式”,点击“导出”即可;

Yann LeCun 这个观点的确有些大胆。 「从现在起 5 年内,没有哪个头脑正常的人会使用自回归模型。」最近,图灵奖得主 Yann LeCun 给一场辩论做了个特别的开场。而他口中的自回归,正是当前爆红的 GPT 家族模型所依赖的学习范式。当然,被 Yann LeCun 指出问题的不只是自回归模型。在他看来,当前整个的机器学习领域都面临巨大挑战。这场辩论的主题为「Do large language models need sensory grounding for meaning and u

ai顶部属性栏不见了的解决办法:1、开启Ai新建画布,进入绘图页面;2、在Ai顶部菜单栏中点击“窗口”;3、在系统弹出的窗口菜单页面中点击“控制”,然后开启“控制”窗口即可显示出属性栏。

ai移动不了东西的解决办法:1、打开ai软件,打开空白文档;2、选择矩形工具,在文档中绘制矩形;3、点击选择工具,移动文档中的矩形;4、点击图层按钮,弹出图层面板对话框,解锁图层;5、点击选择工具,移动矩形即可。

引入密集强化学习,用 AI 验证 AI。 自动驾驶汽车 (AV) 技术的快速发展,使得我们正处于交通革命的风口浪尖,其规模是自一个世纪前汽车问世以来从未见过的。自动驾驶技术具有显着提高交通安全性、机动性和可持续性的潜力,因此引起了工业界、政府机构、专业组织和学术机构的共同关注。过去 20 年里,自动驾驶汽车的发展取得了长足的进步,尤其是随着深度学习的出现更是如此。到 2015 年,开始有公司宣布他们将在 2020 之前量产 AV。不过到目前为止,并且没有 level 4 级别的 AV 可以在市场


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Dreamweaver Mac version
Visual web development tools

Notepad++7.3.1
Easy-to-use and free code editor

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
