Home  >  Article  >  Technology peripherals  >  The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed "Almighty Diffusion"

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed "Almighty Diffusion"

王林
王林forward
2023-04-11 19:30:161039browse

Recent advances in Diffusion models set an impressive milestone in many generative tasks. Attractive works such as DALL·E 2, Imagen, and Stable Diffusion (SD) have aroused great interest in academia and industry.

However, although these models perform amazingly, they are basically focused on a certain type of task, such as generating images from given text. For different types of tasks, it is often necessary to Dedicated training alone, or building a new model from scratch.

So can we build an "all-round" Diffusion based on previous models to achieve the unification of the AIGC model? Some people are trying to explore in this direction and have made progress.

This joint team from the University of Illinois at Urbana-Champaign and the University of Texas at Austin is trying to extend the existing single-stream Diffusion into a multi-stream network, called Versatile Diffusion ( VD), the first unified multi-stream multi-modal diffusion framework and a step towards general generative artificial intelligence.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

##Paper address: https://arxiv.org/abs/2211.08332

In addition to the ordinary text generation function, Versatile Diffusion can also input images to generate similar images, input images to generate text, input text to generate similar text, image semantic decoupling editing, input images and Generate videos from text, edit image content based on latent space, and more.

Future versions will also support more modes such as voice, music, video and 3D.

According to the paper, it has been proven that VD and its basic framework have the following advantages:

a) It can be used with competitive high quality Process all subtasks.

b) Support new extensions and applications, such as separation of graphic style and semantics, image-text dual guidance generation, etc.

c) These experiments and applications provide richer semantic insights into the generated output.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

In terms of training data set, VD uses Laion2B-en with custom data filters as the main data set.

First exploration of

One of the exciting findings of VD is that it can semantically enhance or reduce image style without further supervision.

Such a phenomenon inspired the author to explore a completely new field, where the separation between style and semantics can occur for images with arbitrary styles and arbitrary content.

The authors state that they are the first to explore: a) interpretation of the semantics and style of natural images without domain specification; b) diffusion model latent space Semantic and stylistic decomposition team.

In the image below, the author first generates variants of the input image and then operates on them with a semantic (left) or stylistic (right) focus.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Since VD supports both image to text and text to image, the author team tried editing from the perspective of text prompts for the first time by following the steps below Images: a) Convert image to text, b) Edit text, c) Convert text back to image.

In the experiment the author removed the described content from the image and then added new content using this image-text-image (I2T2I) paradigm. Unlike painting or other image editing methods that require the location of objects as input, VD's I2T2I does not require masks because it automatically positions and replaces objects on command.

However, the output image of I2T2I is inconsistent with the pixels of the input image, which is caused by image-to-text semantic extraction and text-to-image content creation.

In the display below, the input image is first translated into a prompt, and then the prompt is edited using subtraction (red box) and addition (green box). Finally, the edited prompt is translated into an image.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

#In addition, they are also the first team to explore generating similar text based on given text.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Network Framework

Specifically, the VD framework proposed in the article is a multi-stream network with various types of data as input and background.

The VD multi-stream multi-modal diffusion framework inherits the advantages of LDM/SD, with interpretable latent space, modal structure and low computational cost.

VD can jointly train multiple streams, each stream representing a cross-modal task. Its core design is to diffuser the grouping, sharing and switching protocols within the network, adapting the framework to all supported tasks and beyond.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

diffuser is divided into three groups: global layer, data layer and context layer. The global layer is the temporal embedding layer, the data layer is the residual block, and the context layer is the cross-attention.

This grouping corresponds to the function of the layer. When working on multiple tasks, the global layer is shared among all tasks. The data layer and context layer contain multiple data flows. Each data stream can be shared or exchanged based on the current data and context type.

For example, when processing text-image requests, diffuser uses the image data layer and the text context layer. When dealing with image mutation tasks, the image data layer and image context layer are used.

A single VD process contains a VAE, a diffuser and a context encoder, processing a task under one data type (such as image) and one context type (such as text) ( Such as text to image).

The multi-stream structure of Versatile Diffusion is shown in the figure below:

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

The researchers based on Versatile Diffusion, A general multi-stream multi-modal framework is further proposed, which includes VAE, context encoder and diffuser containing three layers (i.e. global, data and context layers).

Diffuser:

VD uses the widely adopted cross-concern UNet as The main architecture of the diffuser network divides the layers into global layer, data layer and context layer. The data layer and context layer have two data streams to support images and text.

For image data flow, follow LDM and use residual block (ResBlock), whose spatial dimension gradually decreases and the number of channels gradually increases.

For text data flow, the new fully connected residual block (FCResBlock) is utilized to expand the 768-dimensional text latent vector into 320*4 hidden features and follow a similar channel Add normalization, reuse GroupNorms, SiLU and skip connections, just like normal ResBlock.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

As shown in the figure above, FCResBlock contains two sets of fully connected layers (FC), group normalization (GN) and sigmoid linear unit (SiLU). x is the input text latent code, t is the input time embedding, and hi is the intermediate feature.

For context groups, cross-attention layers are used for both image and context streams, where content embedding operates data features through projection layers, dot products, and sigmoids.

Variational Autoencoder (VAE):

VD adopts the previous potential The autoencoder-KL of the Latent Diffusion Model (LDM) is used as the image data VAE, and Optimus is used as the text data VAE. Optimus consists of the BERT text encoder and the GPT2 text decoder, which can bidirectionally convert sentences into 768-dimensional normally distributed latent vectors.

At the same time, Optimus also shows satisfactory VAE characteristics with its reconfigurable and interpretable text latent space. Optimus was therefore chosen as the text VAE because it fits well with the prerequisites of a multi-stream multi-modal framework.

Context Encoder:

VD uses CLIP text and Image encoder as context encoder. Unlike LDM and SD which only use raw text embeddings as context input, VD uses normalized and projected embeddings to minimize the CLIP contrast loss of text and images.

Experiments show that a closer embedding space between context types helps the model converge quickly and perform better. Similar conclusions can also be achieved in DALL·E 2, which fine-tunes the text-to-image model with an additional projection layer to minimize the difference between text and image embeddings for image variations.

Performance

The authors used early single-task models as baseline models and compared the results of VD with these baselines. Among them, SDv1.4 is used as the baseline model from text to image, SD-variation is used for image-variation, and BLIP is used for image-text.

Meanwhile, the authors also conduct a qualitative comparison of different VD models, where VDDC and VD-official are used for text to image, and all three models are used for image variants.

The image samples of SD and VD are generated with controlled random seeds to better check the quality.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Text to Image Performance

Although DALLE 2 and Imagen are State-of-the-art results were also achieved on these tasks, but since there are no public code or training details, the authors skip comparing them.

The results show that multi-process structure and multi-task training can help VD capture contextual semantics and generate output more accurately, and complete all sub-tasks excellently.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Performance of Image-Variant

Also, generated by VD The image annotation also contains some creative words. In comparison, the generation of BLIP is very short and lacks description of details.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Image to text performance

Effect display

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

文生图

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Image variations

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

with semantics Focused Image Variants

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Dual Boot

Summary

  • The authors introduce Versatile Diffusion (VD), a multi-stream multi-modal diffusion network that addresses text, images, and variations in a unified model. Based on VD, the author further introduces a general multi-stream multi-modal framework, which can involve new tasks and domains.
  • Through experiments, the authors found that VD can produce high-quality output on all supported tasks, among which VD’s text-to-image and image-to-variant results can better capture the semantics in context, VD The image-to-text results are creative and illustrative.
  • Given the multi-stream multi-modal nature of VD, the authors introduce novel extensions and applications that may further benefit downstream users working on this technology.

Team Introduction

The IFP team at the University of Illinois at Urbana-Champaign was founded by Professor Huang Xutao in the 1980s, originally as Beckman Advanced Science and Technology Research Institute's Image Formation and Processing Group.

The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed Almighty Diffusion

Over the years, IFP has been committed to research and innovation beyond images, including image and video coding, multi-modal human-computer interaction, and multimedia annotation and search, computer vision and pattern recognition, machine learning, big data, deep learning, and high-performance computing.

The current research direction of IFP is to solve the problem of multi-modal information processing by collaboratively combining big data, deep learning and high-performance computing.

In addition, IFP has won several best papers at top conferences in the field of artificial intelligence and won many international competitions, including the first NIST TrecVID, the first ImageNet Challenge and the first Artificial Intelligence City Challenge.

Interestingly, since Professor Huang began teaching at MIT in the 1960s, the "members" of the IFP group have even included friends, students, students' students, and students' students. Even students of students’ students.

The above is the detailed content of The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed "Almighty Diffusion". For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete