


Recent advances in Diffusion models set an impressive milestone in many generative tasks. Attractive works such as DALL·E 2, Imagen, and Stable Diffusion (SD) have aroused great interest in academia and industry.
However, although these models perform amazingly, they are basically focused on a certain type of task, such as generating images from given text. For different types of tasks, it is often necessary to Dedicated training alone, or building a new model from scratch.
So can we build an "all-round" Diffusion based on previous models to achieve the unification of the AIGC model? Some people are trying to explore in this direction and have made progress.
This joint team from the University of Illinois at Urbana-Champaign and the University of Texas at Austin is trying to extend the existing single-stream Diffusion into a multi-stream network, called Versatile Diffusion ( VD), the first unified multi-stream multi-modal diffusion framework and a step towards general generative artificial intelligence.
##Paper address: https://arxiv.org/abs/2211.08332
In addition to the ordinary text generation function, Versatile Diffusion can also input images to generate similar images, input images to generate text, input text to generate similar text, image semantic decoupling editing, input images and Generate videos from text, edit image content based on latent space, and more.
Future versions will also support more modes such as voice, music, video and 3D.
According to the paper, it has been proven that VD and its basic framework have the following advantages:
a) It can be used with competitive high quality Process all subtasks.
b) Support new extensions and applications, such as separation of graphic style and semantics, image-text dual guidance generation, etc.
c) These experiments and applications provide richer semantic insights into the generated output.
In terms of training data set, VD uses Laion2B-en with custom data filters as the main data set.
First exploration ofOne of the exciting findings of VD is that it can semantically enhance or reduce image style without further supervision.
Such a phenomenon inspired the author to explore a completely new field, where the separation between style and semantics can occur for images with arbitrary styles and arbitrary content.
The authors state that they are the first to explore: a) interpretation of the semantics and style of natural images without domain specification; b) diffusion model latent space Semantic and stylistic decomposition team.
In the image below, the author first generates variants of the input image and then operates on them with a semantic (left) or stylistic (right) focus.
Since VD supports both image to text and text to image, the author team tried editing from the perspective of text prompts for the first time by following the steps below Images: a) Convert image to text, b) Edit text, c) Convert text back to image.
In the experiment the author removed the described content from the image and then added new content using this image-text-image (I2T2I) paradigm. Unlike painting or other image editing methods that require the location of objects as input, VD's I2T2I does not require masks because it automatically positions and replaces objects on command.
However, the output image of I2T2I is inconsistent with the pixels of the input image, which is caused by image-to-text semantic extraction and text-to-image content creation.
In the display below, the input image is first translated into a prompt, and then the prompt is edited using subtraction (red box) and addition (green box). Finally, the edited prompt is translated into an image.
#In addition, they are also the first team to explore generating similar text based on given text.
Network Framework
Specifically, the VD framework proposed in the article is a multi-stream network with various types of data as input and background.
The VD multi-stream multi-modal diffusion framework inherits the advantages of LDM/SD, with interpretable latent space, modal structure and low computational cost.
VD can jointly train multiple streams, each stream representing a cross-modal task. Its core design is to diffuser the grouping, sharing and switching protocols within the network, adapting the framework to all supported tasks and beyond.
diffuser is divided into three groups: global layer, data layer and context layer. The global layer is the temporal embedding layer, the data layer is the residual block, and the context layer is the cross-attention.
This grouping corresponds to the function of the layer. When working on multiple tasks, the global layer is shared among all tasks. The data layer and context layer contain multiple data flows. Each data stream can be shared or exchanged based on the current data and context type.
For example, when processing text-image requests, diffuser uses the image data layer and the text context layer. When dealing with image mutation tasks, the image data layer and image context layer are used.
A single VD process contains a VAE, a diffuser and a context encoder, processing a task under one data type (such as image) and one context type (such as text) ( Such as text to image).
The multi-stream structure of Versatile Diffusion is shown in the figure below:
The researchers based on Versatile Diffusion, A general multi-stream multi-modal framework is further proposed, which includes VAE, context encoder and diffuser containing three layers (i.e. global, data and context layers).
Diffuser:
VD uses the widely adopted cross-concern UNet as The main architecture of the diffuser network divides the layers into global layer, data layer and context layer. The data layer and context layer have two data streams to support images and text.
For image data flow, follow LDM and use residual block (ResBlock), whose spatial dimension gradually decreases and the number of channels gradually increases.
For text data flow, the new fully connected residual block (FCResBlock) is utilized to expand the 768-dimensional text latent vector into 320*4 hidden features and follow a similar channel Add normalization, reuse GroupNorms, SiLU and skip connections, just like normal ResBlock.
As shown in the figure above, FCResBlock contains two sets of fully connected layers (FC), group normalization (GN) and sigmoid linear unit (SiLU). x is the input text latent code, t is the input time embedding, and hi is the intermediate feature.
For context groups, cross-attention layers are used for both image and context streams, where content embedding operates data features through projection layers, dot products, and sigmoids.
Variational Autoencoder (VAE):
VD adopts the previous potential The autoencoder-KL of the Latent Diffusion Model (LDM) is used as the image data VAE, and Optimus is used as the text data VAE. Optimus consists of the BERT text encoder and the GPT2 text decoder, which can bidirectionally convert sentences into 768-dimensional normally distributed latent vectors.
At the same time, Optimus also shows satisfactory VAE characteristics with its reconfigurable and interpretable text latent space. Optimus was therefore chosen as the text VAE because it fits well with the prerequisites of a multi-stream multi-modal framework.
Context Encoder:
VD uses CLIP text and Image encoder as context encoder. Unlike LDM and SD which only use raw text embeddings as context input, VD uses normalized and projected embeddings to minimize the CLIP contrast loss of text and images.
Experiments show that a closer embedding space between context types helps the model converge quickly and perform better. Similar conclusions can also be achieved in DALL·E 2, which fine-tunes the text-to-image model with an additional projection layer to minimize the difference between text and image embeddings for image variations.
Performance
The authors used early single-task models as baseline models and compared the results of VD with these baselines. Among them, SDv1.4 is used as the baseline model from text to image, SD-variation is used for image-variation, and BLIP is used for image-text.
Meanwhile, the authors also conduct a qualitative comparison of different VD models, where VDDC and VD-official are used for text to image, and all three models are used for image variants.
The image samples of SD and VD are generated with controlled random seeds to better check the quality.
Text to Image Performance
Although DALLE 2 and Imagen are State-of-the-art results were also achieved on these tasks, but since there are no public code or training details, the authors skip comparing them.
The results show that multi-process structure and multi-task training can help VD capture contextual semantics and generate output more accurately, and complete all sub-tasks excellently.
Performance of Image-Variant
Also, generated by VD The image annotation also contains some creative words. In comparison, the generation of BLIP is very short and lacks description of details.
Image to text performance
Effect display
文生图
Image variations
with semantics Focused Image Variants
Dual Boot
Summary
- The authors introduce Versatile Diffusion (VD), a multi-stream multi-modal diffusion network that addresses text, images, and variations in a unified model. Based on VD, the author further introduces a general multi-stream multi-modal framework, which can involve new tasks and domains.
- Through experiments, the authors found that VD can produce high-quality output on all supported tasks, among which VD’s text-to-image and image-to-variant results can better capture the semantics in context, VD The image-to-text results are creative and illustrative.
- Given the multi-stream multi-modal nature of VD, the authors introduce novel extensions and applications that may further benefit downstream users working on this technology.
Team Introduction
The IFP team at the University of Illinois at Urbana-Champaign was founded by Professor Huang Xutao in the 1980s, originally as Beckman Advanced Science and Technology Research Institute's Image Formation and Processing Group.
Over the years, IFP has been committed to research and innovation beyond images, including image and video coding, multi-modal human-computer interaction, and multimedia annotation and search, computer vision and pattern recognition, machine learning, big data, deep learning, and high-performance computing.
The current research direction of IFP is to solve the problem of multi-modal information processing by collaboratively combining big data, deep learning and high-performance computing.
In addition, IFP has won several best papers at top conferences in the field of artificial intelligence and won many international competitions, including the first NIST TrecVID, the first ImageNet Challenge and the first Artificial Intelligence City Challenge.
Interestingly, since Professor Huang began teaching at MIT in the 1960s, the "members" of the IFP group have even included friends, students, students' students, and students' students. Even students of students’ students.
The above is the detailed content of The AIGC unified model is here! The team founded by Huang Xutao, a leader in the CV industry, proposed 'Almighty Diffusion'. For more information, please follow other related articles on the PHP Chinese website!

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:https://spj.scien

译者 | 李睿审校 | 孙淑娟近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

SublimeText3 Chinese version
Chinese version, very easy to use

SublimeText3 English version
Recommended: Win version, supports code prompts!

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment
