Home  >  Article  >  Pixel Transformers (PiTs) Challenge the Need for Locality Bias in Vision Models

Pixel Transformers (PiTs) Challenge the Need for Locality Bias in Vision Models

PHPz
PHPzOriginal
2024-06-15 09:31:28525browse

A latest research by Meta AI and the University of Amsterdam have shown that transformers, a popular neural network architecture, can operate directly on individual pixels of an image without relying on the locality inductive bias present in most modern computer vision models.

Pixel Transformers (PiTs) Challenge the Need for Locality Bias in Vision Models

Meta AI and researchers from the University of Amsterdam have demonstrated that transformers, a popular neural network architecture, can operate directly on individual pixels of an image, without relying on the locality inductive bias present in most modern computer vision models.

Their study, titled "Transformers on Individual Pixels," challenges the long-held belief that locality – the notion that neighboring pixels are more related than distant ones – is a fundamental requirement for vision tasks.

Traditionally, computer vision architectures like Convolutional Neural Networks (ConvNets) and Vision Transformers (ViTs) have incorporated locality bias through techniques such as convolutional kernels, pooling operations, and patchification, assuming neighboring pixels are more related.

In contrast, the researchers introduced Pixel Transformers (PiTs), which treat each pixel as an individual token, removing any assumptions about the 2D grid structure of images. Surprisingly, PiTs achieved highly performant results across various tasks.

For instance, when PiTs were applied to image generation tasks using latent token spaces from VQGAN, they outperformed their locality-biased counterparts on quality metrics like Fréchet Inception Distance (FID) and Inception Score (IS).

While PiTs, operating on the lines of Perceiver IO Transformers, can be computationally expensive due to longer sequences, they challenge the need for locality bias in vision models. As advances in handling large sequence lengths are made, PiTs may become more practical.

The study ultimately highlights the potential benefits of reducing inductive biases in neural architectures, which could lead to more versatile and capable systems for diverse vision tasks and data modalities.

News source:https://www.kdj.com/cryptocurrencies-news/articles/pixel-transformers-pits-challenge-locality-bias-vision-models.html

The above is the detailed content of Pixel Transformers (PiTs) Challenge the Need for Locality Bias in Vision Models. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn