The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models-AI-php.cn

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Feb 22, 2024 pm 01:19 PM

industryrobot technologySocial networkarrangementdiffusion model

To enable machines to imitate human imagination, deep generative models have made significant progress. These models can create realistic samples, especially the diffusion model, which performs well in multiple areas. The diffusion model solves the limitations of other models, such as the posterior distribution alignment problem of VAEs, the instability of GANs, the computational complexity of EBMs, and the network constraint problem of NFs. Therefore, diffusion models have attracted much attention in aspects such as computer vision and natural language processing.

The diffusion model consists of two processes: the forward process and the reverse process. The forward process transforms the data into a simple prior distribution, while the backward process reverses this change and uses a trained neural network to simulate differential equations to generate the data. Compared with other models, the diffusion model provides a more stable training target and better generation results.

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

However, the sampling process of the diffusion model is accompanied by repeated reasoning and evaluation. This process faces challenges such as instability, high-dimensional computational requirements, and complex likelihood optimization. Researchers have proposed various solutions for this purpose, such as improving ODE/SDE solvers and adopting model distillation strategies to accelerate sampling, as well as new forward processes to improve stability and reduce dimensionality.

Recently, Hong Kong Chinese Language and Literature, Westlake University, MIT, and Zhijiang Laboratory published a review paper titled "A Survey on Generative Diffusion Models" on IEEE TKDE. Recent advances in diffusion models are discussed in four aspects: sampling acceleration, process design, likelihood optimization, and distribution bridging. The review also provides an in-depth look at the success of diffusion models in different application areas such as image synthesis, video generation, 3D modeling, medical analysis, and text generation. Through these application cases, the practicality and potential of the diffusion model in the real world are demonstrated.

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

Paper address: https://arxiv.org/pdf/2209.02646.pdf
Project address: https://github.com/chq1155/ A-Survey-on-Generative-Diffusion-Model?tab=readme-ov-file

##Algorithm Improvement

Sampling acceleration

In the field of diffusion models, one of the key technologies to improve sampling speed is knowledge distillation. This process involves extracting knowledge from a large, complex model and transferring it to a smaller, more efficient model. For example, by using knowledge distillation, we can simplify the sampling trajectory of the model so that the target distribution is approximated with greater efficiency at each step. Salimans et al. adopted an ordinary differential equation (ODE)-based approach to optimize these trajectories, while other researchers developed techniques to estimate clean data directly from noisy samples, thus speeding up the process at time point T.

Improving the training method is also boosting sampling A method of efficiency. Some research focuses on learning new diffusion schemes, where the data is no longer simply spiked with Gaussian noise, but mapped to the latent space through more complex methods. Some of these methods focus on optimizing the inverse decoding process, such as adjusting the depth of encoding, while others explore new noise scale designs so that the addition of noise is no longer static, but becomes a variable that can be modified during the training process. learned parameters.

In addition to training new models To improve efficiency, there are also some techniques dedicated to accelerating the sampling process of already pre-trained diffusion models. ODE acceleration is one such technique that uses ODEs to describe the diffusion process, allowing sampling to proceed faster. For example, DDIM is a method that utilizes ODE for sampling, and subsequent research has introduced more efficient ODE solvers, such as PNDM and EDM, to further improve the sampling speed.

In addition, there are Researchers have proposed analytical methods to speed up sampling. These methods try to find an analytical solution that can directly recover clean data from noisy data without iteration. These methods include Analytic-DPM and its improved version Analytic-DPM, which provide a fast and accurate sampling strategy.

Diffusion process design

Latent spatial diffusion models such as LSGM and INDM combine VAE or normalized flow models to optimize the codec through a shared weighted denoising fractional matching loss and diffusion models, such that the optimization of ELBO or log-likelihood aims to build a latent space that is easy to learn and generate samples. For example, Stable Diffusion first uses a VAE to learn a latent space and then trains a diffusion model to accept text input. DVDP dynamically adjusts the orthogonal components of pixel space during image perturbation.

In order to improve the generation Model efficiency and strength, researchers explore new forward process designs. The Poisson field generation model treats the data as charges, directing a simple distribution to the data distribution along the electric field lines, which provides more powerful backsampling than traditional diffusion models. PFGM further takes this concept into high-dimensional variables. The critically damped Langevin diffusion model of Dockhorn et al. simplifies the learning of fractional functions of conditional velocity distributions using velocity variables in Hamiltonian dynamics.

In discrete In the diffusion model of spatial data (such as text, categorical data), D3PM defines the forward process of discrete space. Based on this method, research has been extended to language text generation, graph segmentation and lossless compression. In multimodal challenges, vector quantized data is converted into codes, showing superior results. Manifold data in Riemannian manifolds, such as robotics and protein modeling, require diffusion sampling to be incorporated into the Riemannian manifold. Combinations of graph neural networks and diffusion theory, such as EDP-GNN and GraphGDP, process graph data to capture permutation invariance.

Likelihood Optimization

Although the diffusion model optimizes ELBO, the likelihood optimization remains a challenge, especially for continuous-time diffusion models. Methods such as ScoreFlow and variational diffusion models (VDM) establish the connection between MLE training and DSM objectives, in which Girsanov's theorem plays a key role. The improved denoising diffusion probabilistic model (DDPM) proposes a hybrid learning objective that combines variational lower bounds and DSM, as well as a simple reparameterization technique.

Distribution connection

Performance of diffusion model when converting Gaussian distribution to complex distribution Excellent, but has challenges when connecting arbitrary distributions. Alpha-hybrid methods create deterministic bridges by iteratively mixing and mixing. Correction flow adds additional steps to correct the bridge path. Another method is to realize the connection between two distributions through ODE, and the method of Schrödinger bridge or Gaussian distribution as the intermediate connection point is also under investigation.

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

Application fields

Image generation

Diffusion The model has been very successful in image generation, not only generating ordinary images but also completing complex tasks such as converting text into images. Models such as Imagen, Stable Diffusion and DALL-E 2 demonstrate great skill in this regard. They use a diffusion model structure, combined with cross-attention layer techniques, to integrate text information into generated images. In addition to generating new images, these models can edit images without requiring retraining. Editing is achieved by adjusting across attention layers (keys, values, attention matrices). For example, adding new concepts by adjusting feature maps to change image elements or introducing new text embeddings. There is research to ensure that the model pays attention to all keywords of the text when generating it to ensure that the image accurately reflects the description. Diffusion models can also handle image-based conditional inputs, such as source images, depth maps, or human skeletons, by encoding and integrating these features to guide image generation. Some studies add source image encoding features to the starting layer of the model to achieve image-to-image editing, which is also applicable to scenes where depth maps, edge detection or skeletons are used as conditions.

3D generation

In terms of 3D generation, the main methods through diffusion models are Two kinds. The first is to train models directly on 3D data, which have been effectively applied to a variety of 3D representations such as NeRF, point clouds, or voxels. For example, researchers have shown how to directly generate point clouds of 3D objects. In order to improve the efficiency of sampling, some studies have introduced hybrid point-voxel representation, or image synthesis as an additional condition for point cloud generation. On the other hand, there are studies that use diffusion models to process NeRF representations of 3D objects, and synthesize novel views and optimize NeRF representations by training perspective-conditional diffusion models. The second approach emphasizes using prior knowledge of 2D diffusion models to generate 3D content. For example, the Dreamfusion project uses a score distillation sampling objective to extract NeRF from a pretrained text-to-image model and achieves low-loss rendered images through a gradient descent optimization process. This process has also been further extended to speed up generation.

Video generation

The video diffusion model is an extension of the 2D image diffusion model. They generate video sequences by adding a temporal dimension. The basic idea of this approach is to add temporal layers to the existing 2D structure as a way to model continuity and dependencies between video frames. Related work shows how to use video diffusion models to generate dynamic content, such as Make-A-Video, AnimatedDiff and other models. More specifically, the RaMViD model uses a 3D convolutional neural network to extend the image diffusion model to video and develops a series of video-specific conditional techniques.

Medical Analysis

##Diffusion model helps solve the problem of obtaining high-quality data in medical analysis set of challenges, especially in medical imaging. These models have been successful in improving image resolution, classification, and noise processing due to their powerful image capture capabilities. For example, Score-MRI and Diff-MIC use advanced techniques to speed up the reconstruction of MRI images and enable more precise classification. MCG employs manifold correction in CT image super-resolution, improving reconstruction speed and accuracy. In terms of generating rare images, the model can convert between different types of images through specific techniques. For example, FNDM and DiffuseMorph are used for brain anomaly detection and MR image registration respectively. Some new methods synthesize training datasets from a small number of high-quality samples, such as a model using 31,740 samples that synthesized a dataset of 100,000 instances and achieved very low FID scores.

Text generation

Text generation technology is an important bridge between humans and AI. Can produce fluent and natural language. Autoregressive language models generate text with strong coherence but are slow, while diffusion models can generate text quickly but with relatively weak coherence. The two mainstream methods are discrete generation and latent generation. Discrete generation relies on advanced techniques and pre-trained models; for example, D3PM and Argmax treat words as categorical vectors, while DiffusionBERT combines diffusion models with language models to improve text generation. Latent generation generates text in the latent space of tokens. For example, models such as LM-Diffusion and GENIE perform well in various tasks, showing the potential of diffusion models in text generation. Diffusion models are expected to improve performance in natural language processing, integrate with large language models, and enable cross-modal generation.

Time series generation

Time series data modeling is used in finance, climate Key technology for prediction and analysis in science, medical and other fields. Diffusion models have been used in the generation of time series data due to their ability to generate high-quality data samples.In this field, diffusion models are often designed to take into account the temporal dependence and periodicity of time series data. For example, CSDI (Conditional Sequence Diffusion Interpolation) is a model that utilizes a bidirectional convolutional neural network structure to generate or interpolate time series data points. It excels in medical data generation and environmental data generation. Other models such as DiffSTG and TimeGrad can better capture the dynamic characteristics of time series and generate more realistic time series samples by combining spatiotemporal convolutional networks. These models gradually recover meaningful time series data from Gaussian noise through self-conditioning guidance.

Audio generation

Audio generation involves everything from speech synthesis to music generation. application scenarios. Since audio data usually contains complex temporal structures and rich spectral information, diffusion models also show potential in this field. For example, WaveGrad and DiffSinger are two diffusion models that utilize a conditional generation process to produce high-quality audio waveforms. WaveGrad uses the Mel spectrum as a conditional input, while DiffSinger adds additional musical information such as pitch and tempo on top of this to provide finer stylistic control. In text-to-speech (TTS) applications, Guided-TTS and Diff-TTS combine the concepts of text encoders and acoustic classifiers to generate speech that both conforms to the text content and follows a specific sound style. Guide-TTS2 further demonstrates how to generate speech without an explicit classifier, guiding sound generation through features learned by the model itself.

Molecular Design

In the fields of drug design, materials science and chemical biology , molecular design is an important step in the discovery and synthesis of new compounds. Diffusion models serve here as a powerful tool to efficiently explore chemical space and generate molecules with specific properties. In unconditional molecule generation, the diffusion model generates molecular structures spontaneously without relying on any prior knowledge. In cross-modal generation, the model may incorporate specific functional conditions, such as drug efficacy or binding propensity of a target protein, to generate molecules with desired properties. Sequence-based methods may consider the protein sequence to guide the generation of molecules, while structure-based methods may use the three-dimensional structural information of the protein. Such structural information can be used as prior knowledge in molecular docking or antibody design, thereby improving the quality of generated molecules.

Graph generation

Generates graphs using a diffusion model aimed at better understanding and simulate real-world network structures and propagation processes. This approach helps researchers mine patterns and interactions in complex systems and predict possible outcomes. Applications include social networks, biological network analysis, and the creation of graph datasets. Traditional methods rely on generating adjacency matrices or node features, but these methods have poor scalability and limited practicality. Therefore, modern graph generation techniques prefer to generate graphs based on specific conditions. For example, the PCFI model uses part of the graph's features and shortest path predictions to guide the generation process; EDGE and DiffFormer use node degree and energy constraints to optimize generation respectively; D4Explainer explores different possibilities of the graph by combining distribution and counterfactual losses. These methods improve the accuracy and practicality of graph generation.

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

Conclusion and Outlook

Challenges under data limitations

In addition to slow inference speed, diffusion models often encounter difficulties in identifying patterns and regularities from low-quality data, causing them to fail to generalize to new scenarios or data sets. Additionally, computational challenges arise when dealing with large-scale datasets, such as extended training times, excessive memory usage, or inability to converge to desired states, thereby limiting model size and complexity. What’s more, biased or uneven data sampling can limit a model’s ability to generate outputs that are adaptable to different domains or populations.

Controllable distribution-based generation

Improve model understanding and generate specific distributions The ability to in-sample is critical to achieving better generalization with limited data. By focusing on identifying patterns and correlations in the data, the model can generate samples that closely match the training data and meet specific requirements. This requires efficient data sampling, utilization techniques, and optimization of model parameters and structures. Ultimately, this enhanced understanding allows for more controlled and precise generation, thereby improving generalization performance.

Advanced multimodal generation utilizing large language models

The future of diffusion models Development directions involve advancing multimodal generation by integrating large language models (LLMs). This integration enables the model to generate outputs containing combinations of text, images, and other modalities. By incorporating LLMs, the model's understanding of the interactions between different modalities is enhanced, and the generated outputs are more diverse and realistic. Furthermore, LLMs significantly improve prompt-based generation efficiency by effectively leveraging the connections between text and other modalities. In addition, LLMs, as catalysts, improve the generation capability of diffusion models and expand the range of fields in which they can generate modes.

Integration with the field of machine learning

Integrate the diffusion model with traditional machine learning Combining theory provides new opportunities to improve performance on a variety of tasks. Semi-supervised learning is particularly valuable in addressing inherent challenges of diffusion models, such as generalization problems, and in enabling efficient conditional generation when data is limited. By leveraging unlabeled data, it enhances the generalization capabilities of diffusion models and achieves ideal performance when generating samples under specific conditions.

In addition, reinforcement learning plays a crucial role by using fine-tuning algorithms to provide targeted guidance during the sampling process of the model. This guidance ensures focused exploration and promotes controlled generation. In addition, reinforcement learning is enriched by integrating additional feedback, thereby improving the model's ability to generate controllable conditions.

Algorithm improvement method (Appendix)

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

Field application method (Appendix )

The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models

The above is the detailed content of The technology behind the explosion of Sora, an article summarizing the latest development direction of diffusion models. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete

DSA如何弯道超车NVIDIA GPU？Sep 20, 2023 pm 06:09 PM

你可能听过以下犀利的观点：1.跟着NVIDIA的技术路线，可能永远也追不上NVIDIA的脚步。2.DSA或许有机会追赶上NVIDIA，但目前的状况是DSA濒临消亡，看不到任何希望另一方面，我们都知道现在大模型正处于风口位置，业界很多人想做大模型芯片，也有很多人想投大模型芯片。但是，大模型芯片的设计关键在哪，大带宽大内存的重要性好像大家都知道，但做出来的芯片跟NVIDIA相比，又有何不同？带着问题，本文尝试给大家一点启发。纯粹以观点为主的文章往往显得形式主义，我们可以通过一个架构的例子来说明Sam

阿里云通义千问14B模型开源！性能超越Llama2等同等尺寸模型Sep 25, 2023 pm 10:25 PM

2021年9月25日，阿里云发布了开源项目通义千问140亿参数模型Qwen-14B以及其对话模型Qwen-14B-Chat，并且可以免费商用。Qwen-14B在多个权威评测中表现出色，超过了同等规模的模型，甚至有些指标接近Llama2-70B。此前，阿里云还开源了70亿参数模型Qwen-7B，仅一个多月的时间下载量就突破了100万，成为开源社区的热门项目Qwen-14B是一款支持多种语言的高性能开源模型，相比同类模型使用了更多的高质量数据，整体训练数据超过3万亿Token，使得模型具备更强大的推

ICCV 2023揭晓：ControlNet、SAM等热门论文斩获奖项Oct 04, 2023 pm 09:37 PM

在法国巴黎举行了国际计算机视觉大会ICCV（InternationalConferenceonComputerVision）本周开幕作为全球计算机视觉领域顶级的学术会议，ICCV每两年召开一次。ICCV的热度一直以来都与CVPR不相上下，屡创新高在今天的开幕式上，ICCV官方公布了今年的论文数据：本届ICCV共有8068篇投稿，其中有2160篇被接收，录用率为26.8%，略高于上一届ICCV2021的录用率25.9%在论文主题方面，官方也公布了相关数据：多视角和传感器的3D技术热度最高在今天的开

复旦大学团队发布中文智慧法律系统DISC-LawLLM，构建司法评测基准，开源30万微调数据Sep 29, 2023 pm 01:17 PM

随着智慧司法的兴起，智能化方法驱动的智能法律系统有望惠及不同群体。例如，为法律专业人员减轻文书工作，为普通民众提供法律咨询服务，为法学学生提供学习和考试辅导。由于法律知识的独特性和司法任务的多样性，此前的智慧司法研究方面主要着眼于为特定任务设计自动化算法，难以满足对司法领域提供支撑性服务的需求，离应用落地有不小的距离。而大型语言模型（LLMs）在不同的传统任务上展示出强大的能力，为智能法律系统的进一步发展带来希望。近日，复旦大学数据智能与社会计算实验室（FudanDISC）发布大语言模型驱动的中

百度文心一言全面向全社会开放，率先迈出重要一步Aug 31, 2023 pm 01:33 PM

8月31日，文心一言首次向全社会全面开放。用户可以在应用商店下载“文心一言APP”或登录“文心一言官网”（https://yiyan.baidu.com）进行体验据报道，百度计划推出一系列经过全新重构的AI原生应用，以便让用户充分体验生成式AI的理解、生成、逻辑和记忆等四大核心能力今年3月16日，文心一言开启邀测。作为全球大厂中首个发布的生成式AI产品，文心一言的基础模型文心大模型早在2019年就在国内率先发布，近期升级的文心大模型3.5也持续在十余个国内外权威测评中位居第一。李彦宏表示，当文心

AI技术在蚂蚁集团保险业务中的应用：革新保险服务，带来全新体验Sep 20, 2023 pm 10:45 PM

保险行业对于社会民生和国民经济的重要性不言而喻。作为风险管理工具，保险为人民群众提供保障和福利，推动经济的稳定和可持续发展。在新的时代背景下，保险行业面临着新的机遇和挑战，需要不断创新和转型，以适应社会需求的变化和经济结构的调整近年来，中国的保险科技蓬勃发展。通过创新的商业模式和先进的技术手段，积极推动保险行业实现数字化和智能化转型。保险科技的目标是提升保险服务的便利性、个性化和智能化水平，以前所未有的速度改变传统保险业的面貌。这一发展趋势为保险行业注入了新的活力，使保险产品更贴近人民群众的实际

致敬TempleOS，有开发者创建了启动Llama 2的操作系统，网友：8G内存老电脑就能跑Oct 07, 2023 pm 10:09 PM

不得不说，Llama2的「二创」项目越来越硬核、有趣了。自Meta发布开源大模型Llama2以来，围绕着该模型的「二创」项目便多了起来。此前7月，特斯拉前AI总监、重回OpenAI的AndrejKarpathy利用周末时间，做了一个关于Llama2的有趣项目llama2.c，让用户在PyTorch中训练一个babyLlama2模型，然后使用近500行纯C、无任何依赖性的文件进行推理。今天，在Karpathyllama2.c项目的基础上，又有开发者创建了一个启动Llama2的演示操作系统，以及一个

快手黑科技“子弹时间”赋能亚运转播，打造智慧观赛新体验Oct 11, 2023 am 11:21 AM

杭州第19届亚运会不仅是国际顶级体育盛会，更是一场精彩绝伦的中国科技盛宴。本届亚运会中，快手StreamLake与杭州电信深度合作，联合打造智慧观赛新体验，在击剑赛事的转播中，全面应用了快手StreamLake六自由度技术，其中“子弹时间”也是首次应用于击剑项目国际顶级赛事。中国电信杭州分公司智能亚运专班组长芮杰表示，依托快手StreamLake自研的4K3D虚拟运镜视频技术和中国电信5G/全光网，通过赛场内部署的4K专业摄像机阵列实时采集的高清竞赛视频，

See all articles