Home  >  Article  >  Technology peripherals  >  Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

PHPz
PHPzforward
2023-04-09 22:31:011093browse

The recent DALLE-2 released by OpenAI and Imagen released by Google have achieved stunning text-to-image generation effects, which have attracted widespread attention and spawned many interesting applications. Text-to-image generation is a typical task in the field of multi-modal image synthesis and editing. Recently, researchers from Max Planck Institute, Nanyang Technological Institute and other institutions conducted a detailed investigation and analysis on the research status and future development of the large field of multi-modal image synthesis and editing.

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review


  • Paper address: https://arxiv.org/pdf/2112.13592 .pdf
  • Project address: https://github.com/fnzhan/MISE


Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

#In the first chapter, this review describes the significance and overall development of multi-modal image synthesis and editing tasks, as well as the contributions of this paper and The overall structure.

In the second chapter, based on the data modalities that guide image synthesis and editing, this review paper introduces the more commonly used visual guidance (such as semantic maps, key point maps, edge maps ), text guidance, voice guidance, scene graph guidance and corresponding modal data processing methods and a unified representation framework.

In the third chapter, according to the model framework of image synthesis and editing, the paper classifies various current methods, including GAN-based methods, autoregressive methods, diffusion model method, and neural radiation field (NeRF) method.

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

##Since GAN-based methods generally use conditional GAN ​​and unconditional GAN ​​inversion, this paper will One category is further divided into intra-modal conditions (e.g. semantic map, edge map), cross-modal conditions (e.g. text and speech), and GAN inversion (unified modality) and described in detail.

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

Compared with GAN-based methods, the autoregressive model method can process multi-modal data more naturally and utilize the currently popular Transformer model. . Autoregressive methods generally first learn a vector quantization encoder to discretely represent images as token sequences, and then autoregressively model the distribution of tokens. Since data such as text and speech can be represented as tokens and used as conditions for autoregressive modeling, various multi-modal image synthesis and editing tasks can be unified into a single framework.

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review


Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review

## Recently, the fiery diffusion model has also been widely used Multimodal synthesis and editing tasks. For example, the amazing DALLE-2 and Imagen are both implemented based on the diffusion model. Compared with GAN, the diffusion generation model has some good properties, such as static training objectives and easy scalability. This paper classifies and analyzes existing methods in detail based on conditional diffusion models and pre-trained diffusion models.

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review


##

The above methods mainly focus on multi-modal synthesis and editing of 2D images. With the recent rapid development of Neural Radiation Fields (NeRF), multi-modal synthesis and editing for 3D perception have attracted more and more attention. Multimodal synthesis and editing for 3D perception is a more challenging task due to the need to consider multi-view consistency. This paper classifies and summarizes the existing work on three methods of single-scene optimization NeRF, generative NeRF and NeRF inversion.

Subsequently, this review compares and discusses the above four model methods. Overall, current state-of-the-art models favor autoregressive and diffusion models over GANs. The application of NeRF in multi-modal synthesis and editing tasks opens a new window for research in this field.

Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review


In Chapter 4, this review brings together popular data in the field of multimodal synthesis and editing Sets and corresponding modal annotations are provided, and current methods are quantitatively compared for typical tasks of each modality (semantic image synthesis, text-to-image synthesis, and voice-guided image editing).

In Chapter 5, the review discusses and analyzes the current challenges and future directions in this field, including large-scale multi-modal data sets, accurate and reliable evaluation indicators , efficient network architecture, and the development direction of 3D perception.

In Chapters 6 and 7, the review elaborates on the potential social impact of this field and summarizes the content and contributions of the article respectively.

The above is the detailed content of Multimodal image synthesis and editing are so popular that the Max Planck Institute, Nanyang Technological Institute and others have published a detailed review. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete