Five promising AI models for image translation
Image-to-image translation
According to the definition provided by Solanki, Nayyar, and Naved in the paper, image-to-image translation is the process of converting images from one domain to another, with the goal of learning Mapping between input images and output images.
In other words, we hope that the model can transform one image a into another image b by learning the mapping function f.
Some people may wonder what the use of these models is and what relevance they have in the world of artificial intelligence. There tend to be many applications, and it's not just limited to art or graphic design. For example, being able to take an image and convert it into another image to create synthetic data (such as a segmented image) is very useful for training self-driving car models. Another tested application is map design, where the model is able to perform both transformations (satellite view to map and vice versa). Image flipping transformations can also be applied to architecture, with models making recommendations on how to complete unfinished projects.
One of the most compelling applications of image conversion is to transform a simple drawing into a beautiful landscape or painting.
5 Most Promising AI Models for Image Translation
Over the past few years, several methods have been developed to solve the problem of image-to-image translation by leveraging generative models. . The most commonly used methods are based on the following architecture:
- Generative Adversarial Network (GAN)
- Variational Autoencoder (VAE)
- Diffusion Model (DVAE)
- Transformers
Pix2Pix
Pix2Pix is a conditional GAN based model. This means that its architecture is composed of Generator network (G) and Discriminator (D). Both networks are trained in an adversarial game, where G's goal is to generate new images that are similar to the dataset, and D has to decide whether the image is generated (fake) or from the dataset (true).
The main differences between Pix2Pix and other GAN models are: (1) The first Generator takes images as input to start the generation process, while ordinary GANs use random noise; (2) Pix2Pix is a fully supervised model , which means that the dataset consists of pairs of images from two domains.
The architecture described in the paper is defined by a U-Net for the generator and a Markovian Discriminator or Patch Discriminator for the discriminator:
- U-Net: by Composed of two modules (downsampling and upsampling). The input image is reduced to a set of smaller images (called feature maps) using convolutional layers, which are then upsampled via transposed convolutions until the original input dimensions are reached. There are skip connections between downsampling and upsampling.
- Patch Discriminator: Convolutional network, its output is a matrix, where each element is the evaluation result of a part (patch) of the image. It includes the L1 distance between the generated and real images to ensure that the generator learns to map the correct function given the input image. Also called Markov because it relies on the assumption that pixels from different patches are independent.
Pix2Pix results
Unsupervised Image to Image Translation (UNIT)
In Pix2Pix, the training process is fully supervised (i.e. we need pairs of image inputs). The purpose of the UNIT method is to learn a function that maps image A to image B without training on two paired images.
The model starts by assuming that two domains (A and B) share a common latent space (Z). Intuitively, we can think of this latent space as an intermediate stage between image domains A and B. So, using the painting-to-image example, we can use the same latent space to generate a painting image backwards or to see a stunning image forward (see Figure X).
In the figure: (a) shared latent space. (b) UNIT architecture: X1 is a picture, , G2 generator, D1, D2 discriminator. Dashed lines represent shared layers between networks.
UNIT model is developed under a pair of VAE-GAN architecture (see above), where the last layer of the encoder (E1, E2) and the first layer of the generator (G1, G2) are shared.
UNIT results
Palette
Palette is a conditional diffusion model developed by the Canadian Google research team. The model is trained to perform 4 different tasks related to image conversion, resulting in high-quality results:
(i) Colorization: Adding color to grayscale images
(ii) Inpainting: Filling the user-specified image area with realistic content
(iii)Uncropping: Enlarging the image frame
(iv)JPEG Recovery: Recovering damaged JPEG images
In the paper, the authors explore the difference between a multi-task general model and multiple specialized models, both trained for one million iterations. The architecture of the model is based on the class conditional U-Net model of Dhariwal and Nichol 2021, using a batch size of 1024 images for 1M training steps. Preprocess and tune noise plans as hyperparameters, use different plans for training and prediction.
Palette results
Vision Transformers (ViT)
Please note that although the following two models are not specifically designed for image transformation , but they are a clear step forward in bringing powerful models such as transformers into the field of computer vision.
Vision Transformers (ViT) is a modification of the Transformers architecture (Vaswani et al., 2017) and was developed for image classification. The model takes an image as input and outputs the probability of belonging to each defined class.
The main problem is that Transformers are designed to take one-dimensional sequences as input, not two-dimensional matrices. For sorting, the authors recommend splitting the image into small chunks, thinking of the image as a sequence (or sentence in NLP) and the chunks as tokens (or words).
To briefly summarize, we can divide the whole process into 3 stages:
1) Embedding: split and flatten small pieces → apply linear transformation → add class tag (this tag will As an image summary considered when classifying)→Position Embedding
2) Transformer-Encoder block: Put the embedded patches into a series of transformer encoder blocks. The attention mechanism learns which parts of the image to focus on.
3) Classification MLP header: Pass the class tokens through the MLP header, which outputs the final probability that the image belongs to each class.
Advantages of using ViT: the arrangement remains unchanged. Compared to CNN, Transformer is not affected by translation (change in position of elements) in the image.
Disadvantages: A large amount of labeled data is required for training (at least 14M images)
TransGAN
TransGAN is a transform-based GAN model designed for image generation and does not use any convolutional layer. Instead, the generator and discriminator are composed of a series of Transformers connected by upsampling and downsampling blocks.
The forward pass of the generator takes a one-dimensional array of random noise samples and passes them through the MLP. Intuitively, we can think of the array as a sentence and the pixel values as words (note that an array of 64 elements can be reshaped into an 8✕8 image of 1 channel). Next, the author applies A series of Transformer blocks, each followed by an upsampling layer that doubles the size of the array (image).
A key feature of TransGAN is Grid-self-attention. When reaching high-dimensional images (i.e. very long arrays 32✕32 = 1024), applying the transformer can lead to explosive costs of the self-attention mechanism, since you need to compare each pixel of the 1024 array with all 255 possible pixels (RGB dimension). Therefore, instead of computing the correspondence between a given token and all other tokens, grid self-attention divides the full-dimensional feature map into several non-overlapping grids and computes the token interactions in each local grid .
The discriminator architecture is very similar to the ViT cited earlier.
TransGAN results on different datasets
The above is the detailed content of Five promising AI models for image translation. For more information, please follow other related articles on the PHP Chinese website!

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

Atom editor mac version download
The most popular open source editor

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

Zend Studio 13.0.1
Powerful PHP integrated development environment

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software