


As one of Sora’s compelling core technologies, DiT utilizes Diffusion Transformer to scale the generative model to a larger scale to achieve outstanding image generation effects.
However, larger model sizes cause training costs to skyrocket.
The research team of Yan Shuicheng and Cheng Mingming from Sea AI Lab, Nankai University, and Kunlun Wanwei 2050 Research Institute proposed a new model called Masked Diffusion Transformer at the ICCV 2023 conference. This model uses mask modeling technology to speed up the training of Diffusion Transformer by learning semantic representation information, and achieves SoTA effects in the field of image generation. This innovation brings new breakthroughs to the development of image generation models and provides researchers with a more efficient training method. By combining expertise and technology from different fields, the research team successfully proposed a solution that increases training speed and improves generation results. Their work has contributed important innovative ideas to the development of the field of artificial intelligence and provided useful inspiration for future research and practice
Picture
Paper address: https://arxiv.org/abs/2303.14389
GitHub address: https://github.com/sail-sg/MDT
Recently, Masked Diffusion Transformer V2 once again refreshed SoTA, increasing the training speed by more than 10 times compared to DiT, and achieving an FID score of 1.58 on the ImageNet benchmark.
The latest versions of papers and codes are open source.
Background
Although diffusion models represented by DiT have achieved significant success in the field of image generation, researchers have found that diffusion models often It is difficult to efficiently learn the semantic relationships between parts of objects in images, and this limitation leads to low convergence efficiency of the training process.
Picture
For example, as shown in the picture above, DiT has learned at the 50kth training step Generate the dog's hair texture, and then learn to generate one of the dog's eyes and mouth at the 200k training step, but miss the other eye.
Even at the 300k training step, the relative position of the dog’s two ears generated by DiT is not very accurate.
This training and learning process reveals that the diffusion model fails to efficiently learn the semantic relationship between the various parts of the object in the image, but only learns the semantic information of each object independently.
The researchers speculate that the reason for this phenomenon is that the diffusion model learns the distribution of real image data by minimizing the prediction loss of each pixel. This process ignores the relationship between the various parts of the object in the image. The semantic relative relationship between them leads to the slow convergence speed of the model.
Method: Masked Diffusion Transformer
Inspired by the above observations, the researchers proposed the Masked Diffusion Transformer (MDT) to improve the training of diffusion models efficiency and build quality.
MDT proposes a mask modeling representation learning strategy designed for Diffusion Transformer to explicitly enhance Diffusion Transformer's learning ability of contextual semantic information and enhance the relationship between objects in the image Associative learning of semantic information.
Picture
As shown in the figure above, MDT introduces mask modeling while maintaining the diffusion training process Learning Strategies. By masking the noisy image token, MDT uses an asymmetric Diffusion Transformer (Asymmetric Diffusion Transformer) architecture to predict the masked image token from the noisy image token that has not been masked, thereby simultaneously achieving the mask modeling and diffusion training processes.
During the inference process, MDT still maintains the standard diffusion generation process. The design of MDT helps Diffusion Transformer have both the semantic information expression ability brought by mask modeling representation learning and the diffusion model's ability to generate image details.
Specifically, MDT maps images to latent space through VAE encoder and processes them in latent space to save computing costs.
During the training process, MDT first masks out some of the noise-added image tokens, and sends the remaining tokens to the Asymmetric Diffusion Transformer to predict all image tokens after denoising.
Asymmetric Diffusion Transformer Architecture
##Picture
Picture
##Picture
As shown in the figure above, MDTv2 further optimizes the learning process of diffusion and mask modeling by introducing a more efficient macro network structure designed for the Masked Diffusion process.
This includes integrating U-Net-style long-shortcut in the encoder and dense input-shortcut in the decoder.
Among them, dense input-shortcut will add noise to the masked token and send it to the decoder, retaining the noise information corresponding to the masked token, thus facilitating the training of the diffusion process. .
In addition, MDT has also introduced better training strategies including the faster Adan optimizer, time-step related loss weights, and expanded mask ratio to further accelerate Masked The training process of the Diffusion model.
Experimental results
ImageNet 256 benchmark generation quality comparison
Image
The above table compares the performance of MDT and DiT under different model sizes under the ImageNet 256 benchmark.
It is obvious that MDT achieves higher FID scores with less training cost at all model sizes.
The parameters and inference costs of MDT are basically the same as DiT, because as mentioned above, the standard diffusion process consistent with DiT is still maintained during the inference process of MDT.
For the largest XL model, MDTv2-XL/2 trained with 400k steps significantly outperforms DiT-XL/2 trained with 7000k steps, with a FID score improvement of 1.92. Under this setting, the results show that MDT has about 18 times faster training than DiT.
For small models, MDTv2-S/2 still achieves significantly better performance than DiT-S/2 with significantly fewer training steps. For example, with the same training of 400k steps, MDTv2 has an FID index of 39.50, which is significantly ahead of DiT's FID index of 68.40.
More importantly, this result also exceeds the performance of the larger model DiT-B/2 at 400k training steps (39.50 vs 43.47).
ImageNet 256 benchmark CFG generation quality comparison
Image
We are still here The above table compares the image generation performance of MDT and existing methods under classifier-free guidance.
MDT surpasses previous SOTA DiT and other methods with an FID score of 1.79. MDTv2 further improves performance, pushing the SOTA FID score for image generation to a new low of 1.58 with fewer training steps.
Similar to DiT, we did not observe saturation of the model’s FID scores during training as we continued training.
MDT refreshes SoTA on PaperWithCode’s leaderboard
Convergence speed comparison
Picture
The above picture compares the DiT-S/2 baseline, MDT-S/2 and MDTv2 on the 8×A100 GPU under the ImageNet 256 benchmark. - FID performance of S/2 under different training steps/training times.
Thanks to its better contextual learning capabilities, MDT surpasses DiT in both performance and generation speed. The training convergence speed of MDTv2 is more than 10 times higher than that of DiT.
MDT is about 3 times faster than DiT in terms of training steps and training time. MDTv2 further improves the training speed by approximately 5 times compared to MDT.
For example, MDTv2-S/2 shows better performance in just 13 hours (15k steps) than DiT-S/2 which takes about 100 hours (1500k steps) to train , which reveals that context representation learning is crucial for faster generative learning of diffusion models.
Summary & Discussion
MDT can utilize the characteristics of image objects by introducing a mask modeling representation learning scheme similar to MAE in the diffusion training process. Context information reconstructs the complete information of incomplete input images, thereby learning the correlation between semantic parts in the image, thereby improving the quality and learning speed of image generation.
Researchers believe that enhancing the semantic understanding of the physical world through visual representation learning can improve the simulation effect of the generative model on the physical world. This coincides with Sora's vision of building a physical world simulator through generative models. Hopefully this work will inspire more work on unifying representation learning and generative learning.
Reference:
https://arxiv.org/abs/2303.14389
The above is the detailed content of New work by Yan Shuicheng/Cheng Mingming! DiT training, the core component of Sora, is accelerated by 10 times, and Masked Diffusion Transformer V2 is open source. For more information, please follow other related articles on the PHP Chinese website!

win7模拟器是什么?相信很多小伙伴都没有听说过,win7模拟器其实是一款为朋友们打造的在手机上模拟使用win7系统的软件,接下来就让小编给大家带来win7模拟器介绍,相信看完你们就会对win7模拟器有更深的了解。win7系统很多的朋友在电脑上都使用过了,但是你们有在手机上使用过win7系统吗?win7模拟器就是一款为朋友们打造的在手机上模拟使用win7系统的软件,让大家可以使用本款软件在安卓手机上来使用win7系统,体验经典的win7界面,可以点击使用我的电脑、开始菜单等各种模块,可以进行各种

不久前OpenAISora以其惊人的视频生成效果迅速走红,在一众文生视频模型中突出重围,成为全球瞩目的焦点。继2周前推出成本直降46%的Sora训练推理复现流程后,Colossal-AI团队全面开源全球首个类Sora架构视频生成模型「Open-Sora1.0」,涵盖了整个训练流程,包括数据处理、所有训练细节和模型权重,携手全球AI热爱者共同推进视频创作的新纪元。先睹为快,我们先看一段由Colossal-AI团队发布的「Open-Sora1.0」模型生成的都市繁华掠影视频。Open-Sora1.0

加油站宇宙再次扩张!由Drago娱乐和心跳游戏HBG共同合作推出的《加油站大亨》再次迎来了全新的DLC《废车场》,你的商业帝国将变得更加庞大,此DLC将在第二季度正式和玩家见面,而在第一季度将会推出《踏浪而行》DLC。新的商机一个叫本杰明的家伙和他的宠物鹦鹉建立了一个废车场的生意,但现在他想要处理这桩生意,做好准备接管它们,扩大你的加油站。学会如何处理废旧汽车,卖掉车上拆下来的金属,赚取利润。保留可以在车库重复使用的零件,或者以更高的价格卖掉它们。为那些想要某些汽车的特殊客户定制汽车,你可以从你

雷电模拟器可以加速游戏吗?雷电模拟器是有脚本加速功能的,雷电模拟器加速执行是指脚本加速,包括循环脚本的间隔时间,加速执行的倍速等都可以设置,功能可以在模拟器启动时执行,也可以在指定时间后自动重启模拟器。还有很多朋友还不知道该怎么使用,快来看看吧。雷电模拟器可以加速游戏吗1、相信很多玩游戏的朋友都听说过游戏加速工具。2、不过雷电模拟器的脚本加速执行与它是完全不同的。3、这里的加速执行只是对脚本播放的加速,类似于倍速播放视频。4、举例来说,我们循环一个长达1分钟的脚本,设置循环时间1个小时。5、然后

华硕组装机怎么开vt?要在华硕组装机上开启VT,首先需要进入计算机的BIOS设置页面。开机时按下相应的按键(一般是Delete键或F2键),进入BIOS界面后找到Advanced或者Security选项菜单,然后找到IntelVirtualizationTechnology或者VT-x选项,将其设置为Enabled。接着保存设置并退出BIOS界面,计算机会自动重启并且VT功能就被成功开启了。需要注意的是,不同型号的华硕主板可能会有些许差异,具体的操作步骤可能会有所不同,还请根据实际情况进行操作。

Debian11如何设置默认终端模拟器?随小编一起看一下具体操作吧。点【所有应用程序】-【设置】-【设置管理器】。单击【默认应用程序】。切换到【实用程序】选项卡,找到【终端模拟器】选项,点下拉按钮,下拉菜单中单击要设置的默认终端即可。

在不得不说,在这个日趋同化的手机市场中,红魔确实是一个相当独特的异样存在。在整个游戏手机品类,都因为高通骁龙的能耗比提升而苦苦挣扎的时候,红魔倒是始终坚持着自己的一套产品理念,直板机身加主动散热,要的就是一个性能释放拉满。在整个行业的旗舰手机,都因为不断堆料的影像模组而变得越来越驼背时,红魔居然真的就给你玩纯平后摄设计,这甚至可能是近四五年来,整个手机市场上仅此一家的产品。(图源:红魔)最重要的是,作为网友意见的集大成者,红魔真的成功吸引了一批拥趸,在几家大厂的子品牌旗舰卖到3000元左右时,这

2月16日,OpenAISora的发布无疑标志着视频生成领域的一次重大突破。Sora基于DiffusionTransformer架构,和市面上大部分主流方法(由2DStableDiffusion扩展)并不相同。为什么Sora坚持使用DiffusionTransformer,其中的原因从同时期发表在ICLR2024(VDT:General-purposeVideoDiffusionTransformersviaMaskModeling)的论文可以窥见一二。这项工作由中国人民大学研究团队主导,并与加


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

SublimeText3 Chinese version
Chinese version, very easy to use

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
