


In 2017, the Google Brain team creatively proposed the Transformer architecture in its paper "Attention Is All You Need". Since then, this research has been a success and has become one of the most popular models in the NLP field today. One, it is widely used in various language tasks and has achieved many SOTA results.
Not only that, Transformer, which has been leading the way in the field of NLP, has quickly swept across fields such as computer vision (CV) and speech recognition, and has achieved good results in tasks such as image classification, target detection, and speech recognition. Effect.
Paper address: https://arxiv.org/pdf/1706.03762 .pdf
#Since its launch, Transformer has become the core module of many models. For example, the familiar BERT, T5, etc. all have Transformer. Even ChatGPT, which has become popular recently, relies on Transformer, which has already been patented by Google.
## Source: https://patentimages.storage.googleapis.com /05/e8/f1/cd8eed389b7687/US10452978.pdf
In addition, the series of models GPT (Generative Pre-trained Transformer) released by OpenAI has Transformer in the name, visible Transformer It is the core of the GPT series of models.
At the same time, OpenAI co-founder Ilya Stutskever recently said when talking about Transformer that when Transformer was first released, it was actually the second day after the paper was released. I couldn’t wait to switch my previous research to Transformer, and then GPT was introduced. It can be seen that the importance of Transformer is self-evident.
Over the past 6 years, models based on Transformer have continued to grow and develop. Now, however, someone has discovered an error in the original Transformer paper.
Transformer architecture diagram and code are "inconsistent"The person who discovered the error was Sebastian, a well-known machine learning and AI researcher and the chief AI educator of the startup Lightning AI. Raschka. He pointed out that the architecture diagram in the original Transformer paper was incorrect, placing layer normalization (LN) between residual blocks, which was inconsistent with the code.
Transformer architecture diagram is as follows on the left, and on the right is the Post-LN Transformer layer (from the paper "On Layer Normalization in the Transformer Architecture" [1]).
The inconsistent code part is as follows. Line 82 writes the execution sequence "layer_postprocess_sequence="dan"", which means that the post-processing executes dropout, residual_add and layer_norm in sequence. If add&norm in the left middle of the above picture is understood as: add is above norm, that is, norm first and then add, then the code is indeed inconsistent with the picture.
Code address:
https://github.com/tensorflow/tensor2tensor/commit/ f5c9b17e617ea9179b7d84d36b1e8162cb369f25#diff-76e2b94ef16871bdbf46bf04dfe7f1477bafb884748f08197c9cf1b10a4dd78e…
Next, Sebastian He also stated that the paper "On Layer Normalization in the Transformer Architecture" believes that Pre-LN performs better and can solve the gradient problem. . This is what many or most architectures do in practice, but it can lead to representation corruption.
Better gradients can be achieved when layer normalization is placed in the residual connection before the attention and fully connected layers.
So while the debate about Post-LN or Pre-LN continues, another paper combines These two points are addressed in "ResiDual: Transformer with Dual Residual Connections"[2].
Regarding Sebastian’s discovery, some people think that we often encounter papers that are inconsistent with the code or results. Most of it is honest, but sometimes it's strange. Considering the popularity of the Transformer paper, this inconsistency should have been mentioned a thousand times over.
Sebastian replied that, to be fair, the "most original" code was indeed consistent with the architecture diagram, but the code version submitted in 2017 was modified and the architecture diagram was not updated. So, this is really confusing.
As one netizen said, "The worst thing about reading code is that you will You often find small changes like this, and you don’t know if they were intentional or not. You can’t even test it because you don’t have enough computing power to train the model.”
I wonder what Google will do in the future Whether to update the code or the architecture diagram, we will wait and see!
The above is the detailed content of The picture is inconsistent with the code. An error was found in the Transformer paper. Netizen: It should have been pointed out 1,000 times.. For more information, please follow other related articles on the PHP Chinese website!

Curses首先出场的是 Curses[1]。CurseCurses 是一个能提供基于文本终端窗口功能的动态库,它可以: 使用整个屏幕 创建和管理一个窗口 使用 8 种不同的彩色 为程序提供鼠标支持 使用键盘上的功能键Curses 可以在任何遵循 ANSI/POSIX 标准的 Unix/Linux 系统上运行。Windows 上也可以运行,不过需要额外安装 windows-curses 库:pip install windows-curses 上面图片,就是一哥们用 Curses 写的 俄罗斯

相比大家都听过自动化生产线、自动化办公等词汇,在没有人工干预的情况下,机器可以自己完成各项任务,这大大提升了工作效率。编程世界里有各种各样的自动化脚本,来完成不同的任务。尤其Python非常适合编写自动化脚本,因为它语法简洁易懂,而且有丰富的第三方工具库。这次我们使用Python来实现几个自动化场景,或许可以用到你的工作中。1、自动化阅读网页新闻这个脚本能够实现从网页中抓取文本,然后自动化语音朗读,当你想听新闻的时候,这是个不错的选择。代码分为两大部分,第一通过爬虫抓取网页文本呢,第二通过阅读工

糟透了我承认我不是一个爱整理桌面的人,因为我觉得乱糟糟的桌面,反而容易找到文件。哈哈,可是最近桌面实在是太乱了,自己都看不下去了,几乎占满了整个屏幕。虽然一键整理桌面的软件很多,但是对于其他路径下的文件,我同样需要整理,于是我想到使用Python,完成这个需求。效果展示我一共为将文件分为9个大类,分别是图片、视频、音频、文档、压缩文件、常用格式、程序脚本、可执行程序和字体文件。# 不同文件组成的嵌套字典 file_dict = { '图片': ['jpg','png','gif','webp

长期以来,Python 社区一直在讨论如何使 Python 成为网页浏览器中流行的编程语言。然而网络浏览器实际上只支持一种编程语言:JavaScript。随着网络技术的发展,我们已经把越来越多的程序应用在网络上,如游戏、数据科学可视化以及音频和视频编辑软件。这意味着我们已经把繁重的计算带到了网络上——这并不是JavaScript的设计初衷。所有这些挑战提出了对新编程语言的需求,这种语言可以提供快速、可移植、紧凑和安全的代码执行。因此,主要的浏览器供应商致力于实现这个想法,并在2017年向世界推出

2017 年 Transformer 横空出世,由谷歌在论文《Attention is all you need》中引入。这篇论文抛弃了以往深度学习任务里面使用到的 CNN 和 RNN。这一开创性的研究颠覆了以往序列建模和 RNN 划等号的思路,如今被广泛用于 NLP。大热的 GPT、BERT 等都是基于 Transformer 构建的。Transformer 自推出以来,研究者已经提出了许多变体。但大家对 Transformer 的描述似乎都是以口头形式、图形解释等方式介绍该架构。关于 Tra

大家好,我是J哥。这个没有点数学基础是很难算出来的。但是我们有了计算机就不一样了,依靠计算机极快速的运算速度,我们利用微分的思想,加上一点简单的三角学知识,就可以实现它。好,话不多说,我们来看看它的算法原理,看图:由于待会要用pygame演示,它的坐标系是y轴向下,所以这里我们也用y向下的坐标系。算法总的思想就是根据上图,把时间t分割成足够小的片段(比如1/1000,这个时间片越小越精确),每一个片段分别构造如上三角形,计算出导弹下一个时间片走的方向(即∠a)和走的路程(即vt=|AC|),这时

Python这门语言很适合用来写些实用的小脚本,跑个自动化、爬虫、算法什么的,非常方便。这也是很多人学习Python的乐趣所在,可能只需要花个礼拜入门语法,就能用第三方库去解决实际问题。我在Github上就看到过不少Python代码的项目,几十行代码就能实现一个场景功能,非常实用。比方说仓库Python-master里就有很多不错的实用Python脚本,举几个简单例子:1. 创建二维码import pyqrcode import png from pyqrcode import QRCode

首先要说,聚类属于机器学习的无监督学习,而且也分很多种方法,比如大家熟知的有K-means。层次聚类也是聚类中的一种,也很常用。下面我先简单回顾一下K-means的基本原理,然后慢慢引出层次聚类的定义和分层步骤,这样更有助于大家理解。层次聚类和K-means有什么不同?K-means 工作原理可以简要概述为: 决定簇数(k) 从数据中随机选取 k 个点作为质心 将所有点分配到最近的聚类质心 计算新形成的簇的质心 重复步骤 3 和 4这是一个迭代过程,直到新形成的簇的质心不变,或者达到最大迭代次数


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

Dreamweaver Mac version
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools

Notepad++7.3.1
Easy-to-use and free code editor

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.
