Home > Article > Technology peripherals > The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed "Almighty Diffusion"
Recent advances in Diffusion models set an impressive milestone in many generative tasks. Attractive works such as DALL·E 2, Imagen, and Stable Diffusion (SD) have aroused great interest in academia and industry.
However, although these models perform amazingly, they are basically focused on a certain type of task, such as generating images from given text. For different types of tasks, it is often necessary to Dedicated training alone, or building a new model from scratch.
So can we build an "all-round" Diffusion based on previous models to achieve the unification of the AIGC model? Some people are trying to explore in this direction and have made progress.
This joint team from the University of Illinois at Urbana-Champaign and the University of Texas at Austin is trying to extend the existing single-stream Diffusion into a multi-stream network, called Versatile Diffusion ( VD), the first unified multi-stream multi-modal diffusion framework and a step towards general generative artificial intelligence.
##Paper address: https://arxiv.org/abs/2211.08332
In addition to the ordinary text generation function, Versatile Diffusion can also input images to generate similar images, input images to generate text, input text to generate similar text, image semantic decoupling editing, input images and Generate videos from text, edit image content based on latent space, and more.
Future versions will also support more modes such as voice, music, video and 3D.
According to the paper, it has been proven that VD and its basic framework have the following advantages:
a) It can be used with competitive high quality Process all subtasks.
b) Support new extensions and applications, such as separation of graphic style and semantics, image-text dual guidance generation, etc.
c) These experiments and applications provide richer semantic insights into the generated output.
In terms of training data set, VD uses Laion2B-en with custom data filters as the main data set.
First exploration ofOne of the exciting findings of VD is that it can semantically enhance or reduce image style without further supervision.
Such a phenomenon inspired the author to explore a completely new field, where the separation between style and semantics can occur for images with arbitrary styles and arbitrary content.
The authors state that they are the first to explore: a) interpretation of the semantics and style of natural images without domain specification; b) diffusion model latent space Semantic and stylistic decomposition team.
In the image below, the author first generates variants of the input image and then operates on them with a semantic (left) or stylistic (right) focus.
Since VD supports both image to text and text to image, the author team tried editing from the perspective of text prompts for the first time by following the steps below Images: a) Convert image to text, b) Edit text, c) Convert text back to image.
In the experiment the author removed the described content from the image and then added new content using this image-text-image (I2T2I) paradigm. Unlike painting or other image editing methods that require the location of objects as input, VD's I2T2I does not require masks because it automatically positions and replaces objects on command.
However, the output image of I2T2I is inconsistent with the pixels of the input image, which is caused by image-to-text semantic extraction and text-to-image content creation.
In the display below, the input image is first translated into a prompt, and then the prompt is edited using subtraction (red box) and addition (green box). Finally, the edited prompt is translated into an image.
#In addition, they are also the first team to explore generating similar text based on given text.
Specifically, the VD framework proposed in the article is a multi-stream network with various types of data as input and background.
The VD multi-stream multi-modal diffusion framework inherits the advantages of LDM/SD, with interpretable latent space, modal structure and low computational cost.
VD can jointly train multiple streams, each stream representing a cross-modal task. Its core design is to diffuser the grouping, sharing and switching protocols within the network, adapting the framework to all supported tasks and beyond.
diffuser is divided into three groups: global layer, data layer and context layer. The global layer is the temporal embedding layer, the data layer is the residual block, and the context layer is the cross-attention.
This grouping corresponds to the function of the layer. When working on multiple tasks, the global layer is shared among all tasks. The data layer and context layer contain multiple data flows. Each data stream can be shared or exchanged based on the current data and context type.
For example, when processing text-image requests, diffuser uses the image data layer and the text context layer. When dealing with image mutation tasks, the image data layer and image context layer are used.
A single VD process contains a VAE, a diffuser and a context encoder, processing a task under one data type (such as image) and one context type (such as text) ( Such as text to image).
The multi-stream structure of Versatile Diffusion is shown in the figure below:
The researchers based on Versatile Diffusion, A general multi-stream multi-modal framework is further proposed, which includes VAE, context encoder and diffuser containing three layers (i.e. global, data and context layers).
Diffuser:
VD uses the widely adopted cross-concern UNet as The main architecture of the diffuser network divides the layers into global layer, data layer and context layer. The data layer and context layer have two data streams to support images and text.
For image data flow, follow LDM and use residual block (ResBlock), whose spatial dimension gradually decreases and the number of channels gradually increases.
For text data flow, the new fully connected residual block (FCResBlock) is utilized to expand the 768-dimensional text latent vector into 320*4 hidden features and follow a similar channel Add normalization, reuse GroupNorms, SiLU and skip connections, just like normal ResBlock.
As shown in the figure above, FCResBlock contains two sets of fully connected layers (FC), group normalization (GN) and sigmoid linear unit (SiLU). x is the input text latent code, t is the input time embedding, and hi is the intermediate feature.
For context groups, cross-attention layers are used for both image and context streams, where content embedding operates data features through projection layers, dot products, and sigmoids.
Variational Autoencoder (VAE):
VD adopts the previous potential The autoencoder-KL of the Latent Diffusion Model (LDM) is used as the image data VAE, and Optimus is used as the text data VAE. Optimus consists of the BERT text encoder and the GPT2 text decoder, which can bidirectionally convert sentences into 768-dimensional normally distributed latent vectors.
At the same time, Optimus also shows satisfactory VAE characteristics with its reconfigurable and interpretable text latent space. Optimus was therefore chosen as the text VAE because it fits well with the prerequisites of a multi-stream multi-modal framework.
Context Encoder:
VD uses CLIP text and Image encoder as context encoder. Unlike LDM and SD which only use raw text embeddings as context input, VD uses normalized and projected embeddings to minimize the CLIP contrast loss of text and images.
Experiments show that a closer embedding space between context types helps the model converge quickly and perform better. Similar conclusions can also be achieved in DALL·E 2, which fine-tunes the text-to-image model with an additional projection layer to minimize the difference between text and image embeddings for image variations.
The authors used early single-task models as baseline models and compared the results of VD with these baselines. Among them, SDv1.4 is used as the baseline model from text to image, SD-variation is used for image-variation, and BLIP is used for image-text.
Meanwhile, the authors also conduct a qualitative comparison of different VD models, where VDDC and VD-official are used for text to image, and all three models are used for image variants.
The image samples of SD and VD are generated with controlled random seeds to better check the quality.
Text to Image Performance
Although DALLE 2 and Imagen are State-of-the-art results were also achieved on these tasks, but since there are no public code or training details, the authors skip comparing them.
The results show that multi-process structure and multi-task training can help VD capture contextual semantics and generate output more accurately, and complete all sub-tasks excellently.
Performance of Image-Variant
Also, generated by VD The image annotation also contains some creative words. In comparison, the generation of BLIP is very short and lacks description of details.
Image to text performance
文生图
Image variations
with semantics Focused Image Variants
Dual Boot
The IFP team at the University of Illinois at Urbana-Champaign was founded by Professor Huang Xutao in the 1980s, originally as Beckman Advanced Science and Technology Research Institute's Image Formation and Processing Group.
Over the years, IFP has been committed to research and innovation beyond images, including image and video coding, multi-modal human-computer interaction, and multimedia annotation and search, computer vision and pattern recognition, machine learning, big data, deep learning, and high-performance computing.
The current research direction of IFP is to solve the problem of multi-modal information processing by collaboratively combining big data, deep learning and high-performance computing.
In addition, IFP has won several best papers at top conferences in the field of artificial intelligence and won many international competitions, including the first NIST TrecVID, the first ImageNet Challenge and the first Artificial Intelligence City Challenge.
Interestingly, since Professor Huang began teaching at MIT in the 1960s, the "members" of the IFP group have even included friends, students, students' students, and students' students. Even students of students’ students.
The above is the detailed content of The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed "Almighty Diffusion". For more information, please follow other related articles on the PHP Chinese website!