Home  >  Article  >  Technology peripherals  >  Transformer's pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Transformer's pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

WBOY
WBOYforward
2023-05-11 12:46:131415browse

Today, the AI ​​circle was shocked by a shocking "overturn".

The diagrams in "Attention Is All Your Need", Google Brain's NLP foundation work and the pioneering paper proposing the Transformer architecture, were found by netizens to be inconsistent with the code.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

## Paper address: https://arxiv.org/abs/1706.03762

Since its inception in 2017, Transformer has become the cornerstone king in the AI ​​field. Even the real mastermind behind the popular ChatGPT is him.

In 2019, Google also applied for a patent specifically for it.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Tracing back to the origin, all kinds of GPT (Generative Pre-trained Transformer) that are emerging in an endless stream all originated from this article 17 years of thesis.

According to Google Scholar, so far, this foundational work has been cited more than 70,000 times.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

So, the foundation stone of ChatGPT is not stable?

As the "originator" of the paper, the structure diagram is actually wrong?

Sebastian Raschka, founder of Lightning AI and machine learning researcher, discovered that the Transformer diagram in this paper is wrong.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

In the circled area in the figure, LayerNorms is after the attention and fully connected layers. Placing layer normalization between residual blocks results in large expected gradients for parameters near the output layer.

Also, this is inconsistent with the code.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

# Code address: https: //github.com/tensorflow/tensor2tensor/commit/f5c9b17e617ea9179b7d84d36b1e8162cb369f25#diff-76e2b94ef16871bdbf46bf04dfe7f1477bafb884748f08197c9cf1b10a4dd78e

However, some netizens pointed out that Noam Shazeer corrected the code a few weeks later.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Subsequently, Sebastian said that in the paper Layer Normalization in the Transformer Architecture, Pre-LN performed better, Gradient problem can be solved.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

This is what many or most architectures do in practice, but it can lead to representation collapse.

Better gradients will be achieved if the layer normalization is placed in the residual connection before the attention and fully connected layers.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Sebastian pointed out that although the discussion about using Post-LN or Pre-LN is still ongoing, there is also a new paper proposing to combine the two.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

##Paper address: https://arxiv.org/abs/2304.14802

In this double-residual Transformer, the problems of representation collapse and gradient disappearance are solved.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Hot discussion among netizens

Regarding the doubtful points in the paper, some netizens pointed out: Isn’t there already something in the middle? Have you learned PreLN and PostLN?

Sebastian replied that he felt a little strange too. Maybe 2nd LN refers to the last output layer rather than each transformer block, but he's not sure about that either.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Some netizens said: "We often encounter papers that do not match the code or results. Most of them are due to Wrong, but sometimes people are very strange. This paper has been circulating for a long time, why this kind of question has never been raised before, it is really strange."

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Sebastian said that to be fair, the original code was consistent with the picture, but they modified the code version in 2017 but did not update the picture. So, this is confusing.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Some netizens said that there are already papers showing a less complex architecture in NormFormer, and his team Their results were also recently confirmed. The ResiDual paper does not mention NormFormer anywhere, which is surprising.

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

Transformers pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid

############# . ########################################So, the paper really has loopholes, Or an own incident? ############Let us wait and see what happens next. ######

The above is the detailed content of Transformer's pioneering paper is shocking? The picture is inconsistent with the code, and the mysterious bug makes me stupid. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete