Home >Technology peripherals >AI >Five promising AI models for image translation
According to the definition provided by Solanki, Nayyar, and Naved in the paper, image-to-image translation is the process of converting images from one domain to another, with the goal of learning Mapping between input images and output images.
In other words, we hope that the model can transform one image a into another image b by learning the mapping function f.
Some people may wonder what the use of these models is and what relevance they have in the world of artificial intelligence. There tend to be many applications, and it's not just limited to art or graphic design. For example, being able to take an image and convert it into another image to create synthetic data (such as a segmented image) is very useful for training self-driving car models. Another tested application is map design, where the model is able to perform both transformations (satellite view to map and vice versa). Image flipping transformations can also be applied to architecture, with models making recommendations on how to complete unfinished projects.
One of the most compelling applications of image conversion is to transform a simple drawing into a beautiful landscape or painting.
Over the past few years, several methods have been developed to solve the problem of image-to-image translation by leveraging generative models. . The most commonly used methods are based on the following architecture:
Pix2Pix is a conditional GAN based model. This means that its architecture is composed of Generator network (G) and Discriminator (D). Both networks are trained in an adversarial game, where G's goal is to generate new images that are similar to the dataset, and D has to decide whether the image is generated (fake) or from the dataset (true).
The main differences between Pix2Pix and other GAN models are: (1) The first Generator takes images as input to start the generation process, while ordinary GANs use random noise; (2) Pix2Pix is a fully supervised model , which means that the dataset consists of pairs of images from two domains.
The architecture described in the paper is defined by a U-Net for the generator and a Markovian Discriminator or Patch Discriminator for the discriminator:
Pix2Pix results
In Pix2Pix, the training process is fully supervised (i.e. we need pairs of image inputs). The purpose of the UNIT method is to learn a function that maps image A to image B without training on two paired images.
The model starts by assuming that two domains (A and B) share a common latent space (Z). Intuitively, we can think of this latent space as an intermediate stage between image domains A and B. So, using the painting-to-image example, we can use the same latent space to generate a painting image backwards or to see a stunning image forward (see Figure X).
In the figure: (a) shared latent space. (b) UNIT architecture: X1 is a picture, , G2 generator, D1, D2 discriminator. Dashed lines represent shared layers between networks.
UNIT model is developed under a pair of VAE-GAN architecture (see above), where the last layer of the encoder (E1, E2) and the first layer of the generator (G1, G2) are shared.
UNIT results
Palette is a conditional diffusion model developed by the Canadian Google research team. The model is trained to perform 4 different tasks related to image conversion, resulting in high-quality results:
(i) Colorization: Adding color to grayscale images
(ii) Inpainting: Filling the user-specified image area with realistic content
(iii)Uncropping: Enlarging the image frame
(iv)JPEG Recovery: Recovering damaged JPEG images
In the paper, the authors explore the difference between a multi-task general model and multiple specialized models, both trained for one million iterations. The architecture of the model is based on the class conditional U-Net model of Dhariwal and Nichol 2021, using a batch size of 1024 images for 1M training steps. Preprocess and tune noise plans as hyperparameters, use different plans for training and prediction.
Palette results
Please note that although the following two models are not specifically designed for image transformation , but they are a clear step forward in bringing powerful models such as transformers into the field of computer vision.
Vision Transformers (ViT) is a modification of the Transformers architecture (Vaswani et al., 2017) and was developed for image classification. The model takes an image as input and outputs the probability of belonging to each defined class.
The main problem is that Transformers are designed to take one-dimensional sequences as input, not two-dimensional matrices. For sorting, the authors recommend splitting the image into small chunks, thinking of the image as a sequence (or sentence in NLP) and the chunks as tokens (or words).
To briefly summarize, we can divide the whole process into 3 stages:
1) Embedding: split and flatten small pieces → apply linear transformation → add class tag (this tag will As an image summary considered when classifying)→Position Embedding
2) Transformer-Encoder block: Put the embedded patches into a series of transformer encoder blocks. The attention mechanism learns which parts of the image to focus on.
3) Classification MLP header: Pass the class tokens through the MLP header, which outputs the final probability that the image belongs to each class.
Advantages of using ViT: the arrangement remains unchanged. Compared to CNN, Transformer is not affected by translation (change in position of elements) in the image.
Disadvantages: A large amount of labeled data is required for training (at least 14M images)
TransGAN is a transform-based GAN model designed for image generation and does not use any convolutional layer. Instead, the generator and discriminator are composed of a series of Transformers connected by upsampling and downsampling blocks.
The forward pass of the generator takes a one-dimensional array of random noise samples and passes them through the MLP. Intuitively, we can think of the array as a sentence and the pixel values as words (note that an array of 64 elements can be reshaped into an 8✕8 image of 1 channel). Next, the author applies A series of Transformer blocks, each followed by an upsampling layer that doubles the size of the array (image).
A key feature of TransGAN is Grid-self-attention. When reaching high-dimensional images (i.e. very long arrays 32✕32 = 1024), applying the transformer can lead to explosive costs of the self-attention mechanism, since you need to compare each pixel of the 1024 array with all 255 possible pixels (RGB dimension). Therefore, instead of computing the correspondence between a given token and all other tokens, grid self-attention divides the full-dimensional feature map into several non-overlapping grids and computes the token interactions in each local grid .
The discriminator architecture is very similar to the ViT cited earlier.
TransGAN results on different datasets
The above is the detailed content of Five promising AI models for image translation. For more information, please follow other related articles on the PHP Chinese website!