search
HomeTechnology peripheralsAIFive promising AI models for image translation

Five promising AI models for image translation

Apr 23, 2023 am 10:55 AM
AINeural Networksgraphic design

Image-to-image translation

According to the definition provided by Solanki, Nayyar, and Naved in the paper, image-to-image translation is the process of converting images from one domain to another, with the goal of learning Mapping between input images and output images.

In other words, we hope that the model can transform one image a into another image b by learning the mapping function f.

用于图像翻译的五 种最有前途的 AI 模型

Some people may wonder what the use of these models is and what relevance they have in the world of artificial intelligence. There tend to be many applications, and it's not just limited to art or graphic design. For example, being able to take an image and convert it into another image to create synthetic data (such as a segmented image) is very useful for training self-driving car models. Another tested application is map design, where the model is able to perform both transformations (satellite view to map and vice versa). Image flipping transformations can also be applied to architecture, with models making recommendations on how to complete unfinished projects.

One of the most compelling applications of image conversion is to transform a simple drawing into a beautiful landscape or painting.

5 Most Promising AI Models for Image Translation

Over the past few years, several methods have been developed to solve the problem of image-to-image translation by leveraging generative models. . The most commonly used methods are based on the following architecture:

  • Generative Adversarial Network (GAN)
  • Variational Autoencoder (VAE)
  • Diffusion Model (DVAE)
  • Transformers

Pix2Pix

Pix2Pix is ​​a conditional GAN ​​based model. This means that its architecture is composed of Generator network (G) and Discriminator (D). Both networks are trained in an adversarial game, where G's goal is to generate new images that are similar to the dataset, and D has to decide whether the image is generated (fake) or from the dataset (true).

The main differences between Pix2Pix and other GAN models are: (1) The first Generator takes images as input to start the generation process, while ordinary GANs use random noise; (2) Pix2Pix is ​​a fully supervised model , which means that the dataset consists of pairs of images from two domains.

The architecture described in the paper is defined by a U-Net for the generator and a Markovian Discriminator or Patch Discriminator for the discriminator:

  • U-Net: by Composed of two modules (downsampling and upsampling). The input image is reduced to a set of smaller images (called feature maps) using convolutional layers, which are then upsampled via transposed convolutions until the original input dimensions are reached. There are skip connections between downsampling and upsampling.
  • Patch Discriminator: Convolutional network, its output is a matrix, where each element is the evaluation result of a part (patch) of the image. It includes the L1 distance between the generated and real images to ensure that the generator learns to map the correct function given the input image. Also called Markov because it relies on the assumption that pixels from different patches are independent.

用于图像翻译的五 种最有前途的 AI 模型

Pix2Pix results

Unsupervised Image to Image Translation (UNIT)

In Pix2Pix, the training process is fully supervised (i.e. we need pairs of image inputs). The purpose of the UNIT method is to learn a function that maps image A to image B without training on two paired images.

The model starts by assuming that two domains (A and B) share a common latent space (Z). Intuitively, we can think of this latent space as an intermediate stage between image domains A and B. So, using the painting-to-image example, we can use the same latent space to generate a painting image backwards or to see a stunning image forward (see Figure X).

In the figure: (a) shared latent space. (b) UNIT architecture: X1 is a picture, , G2 generator, D1, D2 discriminator. Dashed lines represent shared layers between networks.

UNIT model is developed under a pair of VAE-GAN architecture (see above), where the last layer of the encoder (E1, E2) and the first layer of the generator (G1, G2) are shared.

用于图像翻译的五 种最有前途的 AI 模型

UNIT results

Palette

Palette is a conditional diffusion model developed by the Canadian Google research team. The model is trained to perform 4 different tasks related to image conversion, resulting in high-quality results:

(i) Colorization: Adding color to grayscale images

(ii) Inpainting: Filling the user-specified image area with realistic content

(iii)Uncropping: Enlarging the image frame

(iv)JPEG Recovery: Recovering damaged JPEG images

In the paper, the authors explore the difference between a multi-task general model and multiple specialized models, both trained for one million iterations. The architecture of the model is based on the class conditional U-Net model of Dhariwal and Nichol 2021, using a batch size of 1024 images for 1M training steps. Preprocess and tune noise plans as hyperparameters, use different plans for training and prediction.

用于图像翻译的五 种最有前途的 AI 模型

Palette results

Vision Transformers (ViT)

Please note that although the following two models are not specifically designed for image transformation , but they are a clear step forward in bringing powerful models such as transformers into the field of computer vision.

Vision Transformers (ViT) is a modification of the Transformers architecture (Vaswani et al., 2017) and was developed for image classification. The model takes an image as input and outputs the probability of belonging to each defined class.

The main problem is that Transformers are designed to take one-dimensional sequences as input, not two-dimensional matrices. For sorting, the authors recommend splitting the image into small chunks, thinking of the image as a sequence (or sentence in NLP) and the chunks as tokens (or words).

To briefly summarize, we can divide the whole process into 3 stages:

1) Embedding: split and flatten small pieces → apply linear transformation → add class tag (this tag will As an image summary considered when classifying)→Position Embedding

2) Transformer-Encoder block: Put the embedded patches into a series of transformer encoder blocks. The attention mechanism learns which parts of the image to focus on.

3) Classification MLP header: Pass the class tokens through the MLP header, which outputs the final probability that the image belongs to each class.

Advantages of using ViT: the arrangement remains unchanged. Compared to CNN, Transformer is not affected by translation (change in position of elements) in the image.

Disadvantages: A large amount of labeled data is required for training (at least 14M images)

TransGAN

TransGAN is a transform-based GAN model designed for image generation and does not use any convolutional layer. Instead, the generator and discriminator are composed of a series of Transformers connected by upsampling and downsampling blocks.

The forward pass of the generator takes a one-dimensional array of random noise samples and passes them through the MLP. Intuitively, we can think of the array as a sentence and the pixel values ​​as words (note that an array of 64 elements can be reshaped into an 8✕8 image of 1 channel). Next, the author applies A series of Transformer blocks, each followed by an upsampling layer that doubles the size of the array (image).

A key feature of TransGAN is Grid-self-attention. When reaching high-dimensional images (i.e. very long arrays 32✕32 = 1024), applying the transformer can lead to explosive costs of the self-attention mechanism, since you need to compare each pixel of the 1024 array with all 255 possible pixels (RGB dimension). Therefore, instead of computing the correspondence between a given token and all other tokens, grid self-attention divides the full-dimensional feature map into several non-overlapping grids and computes the token interactions in each local grid .

The discriminator architecture is very similar to the ViT cited earlier.

用于图像翻译的五 种最有前途的 AI 模型

用于图像翻译的五 种最有前途的 AI 模型

TransGAN results on different datasets


The above is the detailed content of Five promising AI models for image translation. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
A Comprehensive Guide to ExtrapolationA Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayThe Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierEvolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgNew Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficThe 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DMIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software