


Transformer has become a popular choice for various machine learning tasks and has achieved good results, so how else can it be used? Researchers with great imagination actually want to use it to design programmable computers!
##The authors of this paper are from Princeton University and the University of Wisconsin. The title is "Looped Transformers as Programmable Computers" and its purpose is I'm exploring how to use Transformer to implement a general-purpose computer.
Specifically, the authors propose a framework for using transformer networks as general-purpose computers by programming them with specific weights and placing them in loops. In this framework, the input sequence acts as a punch card, consisting of instructions and memory for data reading/writing.
The authors demonstrate that a constant number of encoder layers can simulate basic computational blocks. Using these building blocks, they simulated a small instruction set computer. This allowed them to map the iterative algorithm to a program that could be executed by a looped 13-layer transformer. They show how this transformer can simulate a basic calculator, a basic linear algebra library, and an in-context learning algorithm using backpropagation, guided by its inputs. This work highlights the versatility of attention mechanisms and demonstrates that even shallow transformers can perform full-blown general-purpose programs.
Paper OverviewTransformer (TF) has become a popular choice for various machine learning tasks, and it has achieved results on many problems in fields such as natural language processing and computer vision. state-of-the-art results. A key reason for Transformer's success is its ability to capture higher-order relationships and long-range dependencies through attention mechanisms. This allows TF to model contextual information and makes it more effective in tasks such as machine translation and language modeling, where Transformer consistently outperforms other methods.
Language models with hundreds of billions of parameters, such as GPT-3 (175B parameters) and PaLM (540B parameters), have achieved state-of-the-art performance on many natural language processing tasks. Interestingly, some of these large language models (LLMs) can also perform in-context learning (ICL), adapting and performing specific tasks on the fly based on a short prompt and some examples. The ICL capabilities of LLMs are available without having to train them, and allow these large models to efficiently perform new tasks without updating weights.
Surprisingly, through ICL, LLM can perform algorithmic tasks and inference, and [Nye et al. [2021], Wei et al. [2022c], Lewkowycz et al [2022], Wei et al. [2022b], Zhou et al. [2022]] and others have proven its feasibility. Work by [Zhou et al. [2022] ] and others demonstrates that LLM can successfully perform addition operations on unknown use cases when given a prompt with a multi-bit addition algorithm and some addition examples. These results show that LLM can perform pre-instructed commands on given inputs at inference time, based on algorithmic principles, as if interpreting natural language as code.
There is evidence that Transformer can simulate recursive links between Turing machines or Attention layers with sufficient depth [Pérez et al. [2021], Pérez et al. [2019], Wei et al. [2022a]]. This demonstrates the potential of Transformer networks to precisely follow the algorithmic instructions specified by the input. However, these constructs are relatively general and do not provide a deep understanding of how to create a Transformer capable of performing specific algorithmic tasks.
However, a more professional design allows TF to perform more advanced programs. For example, [Weiss et al. [2021]] designed a computational model and a programming language that map simple selection and aggregation commands to index input tokens. This language can be used to create a variety of interesting algorithms, such as token counting, sorting, creating histograms, and identifying Dyck-k languages. Programs written in Restricted Access Sequence Processing Language (RASP) can then be mapped into Transformer networks, whose size typically scales with the size of the program.
Another study demonstrated a method for selecting Transformer model weights for use as an optimization algorithm for dynamically learning linear regression models, performing implicit operations at inference time given training data as input. style training. These methods typically require a number of layers proportional to the number of iterations of the learning algorithm and are limited to a single loss function and ensemble of models.
The ability to program Transformer models to simulate the abstract computation of Turing machines, specialized commands for languages like RASP, and specific algorithms for ICL highlights the potential of Transformer networks as versatile programmable computers.
The author's research aims to explore this promising prospect and reveal how attention mechanisms can simulate general-purpose computers inspired by instruction set architectures.
Transformers as Programmable Computers
In this article, the authors show that Transformer networks can be made by hard-coding them with specific weights and placing them in a loop to simulate complex algorithms and programs. The authors did this by reverse-engineering Attention to simulate basic computational blocks, such as editing operations on input sequences, nonlinear functions, function calls, program counters, and conditional branches. The authors' paper demonstrates the importance of using a single loop or recursion to concatenate a Transformer's output sequence back to its input, thereby avoiding the need for a deep model.
##Paper address: https://arxiv.org/pdf/2301.13196.pdf
The author achieves this by designing a Transformer that can execute a program written in a general version of a single instruction, called SUBLEQ (A,B,C), that is, if less than or equal to zero, subtract and combine branch. SUBLEQ is a single-instruction language that defines a one-instruction set computer (OISC). SUBLEQ consists of 3 memory address operands. During execution, the value of memory address B is subtracted from the value of memory address A, and the result is stored in B. If the result of B is less than or equal to 0, jump to address C, otherwise continue executing the next instruction. But this instruction defines a general-purpose computer.
The author constructs an explicit Transformer that implements a SUBLEQ-like program. The author calls it FLEQ, a more flexible single instruction in the form
where f_m can be selected from a set of functions (matrix multiplication/nonlinear function/polynomial, etc.), which can be hard-coded into the network. The depth of a looping Transformer that can execute a FLEQ program does not depend on the depth of the program or the number of lines of code, but rather on the depth required to implement a single FLEQ instruction, which is constant. This is achieved by running the Transformer in a loop over the input sequence, similar to how a CPU operates.
The author uses this framework to demonstrate the ability to simulate various functions during reasoning, including a basic calculator, a basic linear algebra library (matrix transpose, multiplication, solution Inverse, power iteration) and ICL implementing backpropagation on implicitly fully connected networks. The input sequence or prompt acts like a punch card, containing the instructions that the Transformer needs to execute, while providing space for storing and handling variables used in the program. The Transformer networks used to perform these procedures all have a depth less than or equal to 13, and the weight matrices for all these models are provided. The following theorem summarizes the author's main findings:
Theorem 1: There exists a cyclic Transformer with less than 13 layers, which can simulate a general-purpose computer (Section 5 of the article), a basic calculator (Section 7 of the article), numerical linear algebra methods such as approximate matrix inversion and power iteration (Section 8 of the article), and ICL algorithms based on neural networks (such as SGD) (Section 9 of the article).
Figure 1: Schematic diagram of the loop Transformer architecture, in which the sequence storage command is input and read from / Memory for writing data and scratchpad for storing intermediate results. Inputs are processed by the network and outputs are used as new inputs, allowing the network to iteratively update implicit states and perform complex calculations.
The author's research emphasizes the flexibility of the Attention mechanism and the importance of a single loop, which makes it possible to design models that can simulate complex iterative algorithms and execute general programs. and further demonstrated the ability of the Transformer model to efficiently perform complex mathematical and algorithmic tasks. It is conceivable that modern Transformers, such as GPT-3, use similar internal subroutines when performing various tasks. To some extent, the capabilities of these model-specific techniques or algorithms can be enlightened when given contextual examples and explanations, similar to function calls. However, this assumption should be treated with caution, as the way the authors designed the structure bears no resemblance to how real-world language models are trained.
The authors hope that their study will encourage further research into the potential of attention mechanisms and the ability of language models to execute algorithmic instructions. The design proposed by the authors can help determine the minimum Transformer network size required to perform a specific algorithmic task. Additionally, the authors hope that their findings will help inspire the development of methods to enhance the ability to train language models by leveraging smaller, reverse-engineered Transformer networks to accomplish specific algorithmic tasks.
Building a Transformer module for general computing
To use the Transformer network to build a general computing framework, a specialized computing block is required. Assemble these blocks to create the desired final functionality. The following focuses on the various operations that the Transformer layer can perform. These operations will provide the basis for creating more complex routines and algorithms. These operations are designed to be interoperable with each other, leveraging Attention's ability to perform a variety of tasks, such as generating approximate permutation matrices and approximating general functions via sigmoid functions.
Figure 2: Three Transformer blocks used as building blocks for implementing a small instruction set computer schematic diagram. These blocks handle editing operations in the input sequence (such as moving or copying from one block to another), tracking the program counter, and performing program counter jumps when specified conditions are met.
Position encoding, program counter and data pointer
Transformer is usually required to perform iterative algorithms Or execute a series of commands. To achieve this, the author uses a program counter that loops through commands. The counter contains a code for the location where the next command is stored. Additionally, commands may have data pointers pointing to the data locations that the command needs to read and write. Both the program counter and the data pointer use the same positional encoding discussed in the previous paragraph.
The author's positional encoding scheme can also be used to point to specific data locations for reading or writing, which will be discussed below One section discusses. This is achieved by using the same binary vector as the position encoding for the program counter and data pointer. Additionally, this technique of pointing to specific data locations enables Transformers to efficiently read/write data during the execution of an algorithm or sequence of commands it was built to implement.
Read/write: copy data and instructions to or from the temporary register
# Figure 3: Schematic diagram of the read operation. The arrow shows the command block copied from the input portion assigned to the scratchpad command. An instruction is a set of pointers. Position codes and counters are used to track what content is copied where.
#The following lemma states that the command pointed to by the program counter or the data at the location specified in the current command can be copied to the scratchpad for further calculations. The location of the program counter is usually directly below the contents of the scratchpad, but can be changed arbitrarily. Keeping it in a specific position throughout the calculation helps keep the structure well organized.
The next lemma explains that a vector v stored in a scratchpad can be copied to a specified location in memory, as specified by the scratchpad itself. This allows data to be transferred from the scratchpad to a specific location in memory for further use or storage.
Figure 4: Schematic diagram of write operation. The arrow shows that the block of data is being copied from the scratchpad to the specified location in the input section allocated to memory. Positional encoding is used to track target locations and ensure data is written to the correct memory location.
##Conditional branch
In this part, the author implements a conditional branch instruction that evaluates a condition and sets the program counter to a specified location when the condition is true, or increments the program counter by 1 when the condition is false.
The form of the command is as follows: if mem[a]≤0, then go to i, where mem[a] is the value at a position in the memory part of the input sequence. This command has two parts: judging the inequality and modifying the program counter.
Simulating a Universal Single Instruction Set ComputerSUBLEQ Transformer
Mavaddat and Parhami back in 1988 It has been demonstrated in 2001 that there is an instruction and that any computer program can be converted into a program consisting of instantiations of this instruction. A variation of this instruction is SUBLEQ, which can access different registers or memory locations.
The way SUBLEQ works is simple. It accesses two registers in memory, gets the difference of their contents and stores it back to one of the registers, then if the result is negative it jumps to a different predefined line of code or continues with the next instruction of the current line . A computer built to execute SUBLEQ programs is called a single instruction set computer and is a general purpose computer, i.e. it is Turing complete if it has access to infinite memory.
The following describes the construction of a looping Transformer that can execute programs written with a specific instruction set. Transformer keeps track of lines of code, memory locations, and program counters, using the memory portion of the input as memory registers and the command portion as lines of code/instructions. The temporary register is used to record additions and pointers involved in each instruction, reading, writing, conditional branch operations, etc.
Figure 5: Graphical representation of the implementation of the OISC instruction block. The first two blocks transfer the data/commands to the scratchpad, the second and third perform the subtraction and store the result, and the last one performs the if goto command that completes the instruction.
FLEQ: A more flexible Attention-based computer
In this section, the author introduces FLEQ To introduce, it is a generalization of SUBLEQ, which defines a more flexible reduced instruction set computer. This implicit additional instruction set is based on a more advanced version of SUBLEQ and allows multiple functions to be implemented in the same Transformer network. The authors use the term FLEQ to refer to the instructions, language, and attention-based computers it defines.
FLEQ is designed to allow the implementation of complex algorithms such as matrix multiplication, square root calculations, activation functions, etc. by generating functions that are more general than simple subtraction.
Attention-based computer execution cycle. On each iteration of the loop Transformer, an instruction is fetched from the instruction set in the input based on the program counter. The instruction is then copied to the scratchpad. Depending on the function to be implemented, different function block locations are used to locally record the results of the function. Once the result is calculated, it is copied back to the specified memory location provided by the instruction.
The execution cycle is similar to that of a Single Instruction Set Computer (OISC) from the previous section, the main difference being that for each instruction, one can choose from a list of pre-selected functions. Functions take input as arbitrary arrays, such as matrices, vectors, and scalars.
The format of the input sequence. As shown in Figure 6, the input X of the loop Transformer can execute a program composed of a series of FLEQ instructions (X consists of three parts: temporary register, memory, and instructions).
Format of Transformer-based function block. Each function block is located in the lower left portion of input X, as shown in Figure 6.
Figure 6: Structure of input X for executing FLEQ command
Please refer to the original paper for more technical details and examples.
The above is the detailed content of Using Transformer as a general-purpose computer, it can also execute in-context learning algorithms. This research is very imaginative.. For more information, please follow other related articles on the PHP Chinese website!

对于下一代集中式电子电器架构而言,采用central+zonal 中央计算单元与区域控制器布局已经成为各主机厂或者tier1玩家的必争选项,关于中央计算单元的架构方式,有三种方式:分离SOC、硬件隔离、软件虚拟化。集中式中央计算单元将整合自动驾驶,智能座舱和车辆控制三大域的核心业务功能,标准化的区域控制器主要有三个职责:电力分配、数据服务、区域网关。因此,中央计算单元将会集成一个高吞吐量的以太网交换机。随着整车集成化的程度越来越高,越来越多ECU的功能将会慢慢的被吸收到区域控制器当中。而平台化

新视角图像生成(NVS)是计算机视觉的一个应用领域,在1998年SuperBowl的比赛,CMU的RI曾展示过给定多摄像头立体视觉(MVS)的NVS,当时这个技术曾转让给美国一家体育电视台,但最终没有商业化;英国BBC广播公司为此做过研发投入,但是没有真正产品化。在基于图像渲染(IBR)领域,NVS应用有一个分支,即基于深度图像的渲染(DBIR)。另外,在2010年曾很火的3D TV,也是需要从单目视频中得到双目立体,但是由于技术的不成熟,最终没有流行起来。当时基于机器学习的方法已经开始研究,比

我们经常可以看到蜜蜂、蚂蚁等各种动物忙碌地筑巢。经过自然选择,它们的工作效率高到叹为观止这些动物的分工合作能力已经「传给」了无人机,来自英国帝国理工学院的一项研究向我们展示了未来的方向,就像这样:无人机 3D 打灰:本周三,这一研究成果登上了《自然》封面。论文地址:https://www.nature.com/articles/s41586-022-04988-4为了展示无人机的能力,研究人员使用泡沫和一种特殊的轻质水泥材料,建造了高度从 0.18 米到 2.05 米不等的结构。与预想的原始蓝图

与人类行走一样,自动驾驶汽车想要完成出行过程也需要有独立思考,可以对交通环境进行判断、决策的能力。随着高级辅助驾驶系统技术的提升,驾驶员驾驶汽车的安全性不断提高,驾驶员参与驾驶决策的程度也逐渐降低,自动驾驶离我们越来越近。自动驾驶汽车又称为无人驾驶车,其本质就是高智能机器人,可以仅需要驾驶员辅助或完全不需要驾驶员操作即可完成出行行为的高智能机器人。自动驾驶主要通过感知层、决策层及执行层来实现,作为自动化载具,自动驾驶汽车可以通过加装的雷达(毫米波雷达、激光雷达)、车载摄像头、全球导航卫星系统(G

实时全局光照(Real-time GI)一直是计算机图形学的圣杯。多年来,业界也提出多种方法来解决这个问题。常用的方法包通过利用某些假设来约束问题域,比如静态几何,粗糙的场景表示或者追踪粗糙探针,以及在两者之间插值照明。在虚幻引擎中,全局光照和反射系统Lumen这一技术便是由Krzysztof Narkowicz和Daniel Wright一起创立的。目标是构建一个与前人不同的方案,能够实现统一照明,以及类似烘烤一样的照明质量。近期,在SIGGRAPH 2022上,Krzysztof Narko

由于智能汽车集中化趋势,导致在网络连接上已经由传统的低带宽Can网络升级转换到高带宽以太网网络为主的升级过程。为了提升车辆升级能力,基于为车主提供持续且优质的体验和服务,需要在现有系统基础(由原始只对车机上传统的 ECU 进行升级,转换到实现以太网增量升级的过程)之上开发一套可兼容现有 OTA 系统的全新 OTA 服务系统,实现对整车软件、固件、服务的 OTA 升级能力,从而最终提升用户的使用体验和服务体验。软件升级触及的两大领域-FOTA/SOTA整车软件升级是通过OTA技术,是对车载娱乐、导

internet的基本结构与技术起源于ARPANET。ARPANET是计算机网络技术发展中的一个里程碑,它的研究成果对促进网络技术的发展起到了重要的作用,并未internet的形成奠定了基础。arpanet(阿帕网)为美国国防部高级研究计划署开发的世界上第一个运营的封包交换网络,它是全球互联网的始祖。

开始正文之前,我们先来看一张图片,在下图中,很明显可以看出,图的右半部分所代表的信息更加丰富,结构也更清晰。而左半部分 2016 年的图,则结构较为单一,代表的信息比较少:其实上面展示的是核孔复合体(NPC)图像。核孔复合体,由约 1000 个蛋白质亚基组成,担负着真核生物细胞核与细胞质之间繁忙的运输大分子的任务,也是其连接胞质和细胞核的唯一双向通道。除了协调运输外,NPC 还组织必要的转录、mRNA 成熟、剪接体和核糖体组装等重要生命活动。NPC 强大的作用,已然成为疾病突变和宿主 - 病原


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Linux new version
SublimeText3 Linux latest version

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function
