


Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation
Researchers from Fudan University and Huawei's Noah's Ark Laboratory proposed an iterative solution for generating high-quality videos based on the image diffusion model (LDM) - VidRD (Reuse and Diffuse). This solution aims to make breakthroughs in the quality and sequence length of generated videos, and achieve high-quality, controllable video generation of long sequences. It effectively reduces the jitter problem between generated video frames, has high research and practical value, and contributes to the current hot AIGC community.
Latent diffusion model (LDM) is a generative model based on denoising autoencoder, which can generate high-quality data from randomly initialized data by gradually removing noise. sample. However, due to computational and memory limitations during both model training and inference, a single LDM can usually only generate a very limited number of video frames. Although existing work attempts to use a separate prediction model to generate more video frames, this also incurs additional training costs and produces frame-level jitter.
In this paper, inspired by the remarkable success of latent diffusion models (LDMs) in image synthesis, a framework called "Reuse and Diffuse", referred to as VidRD, is proposed. This framework can generate more video frames after the small number of video frames already generated by LDM, thereby iteratively generating longer, higher quality, and diverse video content. VidRD loads a pre-trained image LDM model for efficient training and uses a U-Net network with added temporal information for noise removal.
- ##Paper title: Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
- Paper address: https://arxiv.org/abs/2309.03549
- Project homepage: https://anonymous0x233.github.io/ ReuseAndDiffuse/
The main contributions of this article are as follows:
- In order to generate smoother videos, this article is based on timing-aware The LDM model proposes an iterative "text-to-video" generation method. This method can iteratively generate more video frames by reusing latent space features from already generated video frames and following the previous diffusion process each time.
- This article designs a set of data processing methods to generate high-quality "text-video" data sets. For the existing action recognition data set, this paper uses a multi-modal large language model to give text descriptions to the videos. For image data, this paper uses random scaling and translation methods to generate more video training samples.
- On the UCF-101 data set, this article verified the two evaluation indicators of FVD and IS as well as the visualization results. The quantitative and qualitative results show that: compared with existing methods, the VidRD model All achieved better results.
Method introduction
Figure 1. Schematic diagram of the VidRD video generation framework proposed in this article
This article believes that using pre-trained image LDM as the starting point for LDM training for high-quality video synthesis is an efficient and wise choice. At the same time, this view is further supported by research work such as [1, 2]. In this context, the carefully designed model in this article is built based on the pre-trained stable diffusion model, fully learning from and inheriting its excellent characteristics. These include a variational autoencoder (VAE) for accurate latent representation and a powerful denoising network U-Net. Figure 1 shows the overall architecture of the model in a clear and intuitive way.
In the model design of this article, a notable feature is the full utilization of the weights of the pre-trained model. Specifically, most network layers, including the components of VAE and the upsampling and downsampling layers of U-Net, are initialized using pre-trained weights of the stable diffusion model. This strategy not only significantly speeds up the model training process, but also ensures that the model exhibits good stability and reliability from the beginning. Our model can iteratively generate additional frames from an initial video clip containing a small number of frames by reusing the original latent features and mimicking the previous diffusion process. In addition, for the autoencoder used to convert between pixel space and latent space, we inject timing-related network layers into its decoder and fine-tune these layers to improve temporal consistency.
In order to ensure the continuity between video frames, this article adds 3D Temp-conv and Temp-attn layers to the model. The Temp-conv layer follows the 3D ResNet, which implements 3D convolution operations to capture spatial and temporal correlations to understand the dynamics and continuity of video sequence aggregation. The Temp-Attn structure is similar to Self-attention and is used to analyze and understand the relationship between frames in the video sequence, allowing the model to accurately synchronize the running information between frames. These parameters are randomly initialized during training and are designed to provide the model with understanding and encoding of temporal structure. In addition, in order to adapt to the model structure, the data input has also been adapted and adjusted accordingly.
Figure 2. The high-quality “text-video” training data set construction method proposed in this article
In order to train the VidRD model, this article proposes a method of constructing a large-scale "text-video" training data set, as shown in Figure 2. This method can handle "text-image" data and "text-video" without description data. In addition, in order to achieve high-quality video generation, this article also attempts to remove watermarks on the training data.
Although high-quality video description datasets are relatively scarce in the current market, a large number of video classification datasets exist. These datasets have rich video content, and each video is accompanied by a classification label. For example, Moments-In-Time, Kinetics-700 and VideoLT are three representative large-scale video classification data sets. Kinetics-700 covers 700 human action categories and contains over 600,000 video clips. Moments-In-Time includes 339 action categories, with a total of more than one million video clips. VideoLT, on the other hand, contains 1,004 categories and 250,000 long, unedited videos.
In order to make full use of existing video data, this article attempts to automatically annotate these videos in more detail. This article uses multi-modal large language models such as BLIP-2 and MiniGPT4. By targeting key frames in the video and combining their original classification labels, this article designs many Prompts to generate annotations through model question and answer. This method not only enhances the speech information of video data, but also brings more comprehensive and detailed video descriptions to existing videos that do not have detailed descriptions, thereby enabling richer video tag generation to help the VidRD model bring better training effect.
In addition, for the existing very rich image data, this article also designed a detailed method to convert the image data into video format for training. The specific operation is to pan and zoom at different positions of the image at different speeds, thereby giving each image a unique dynamic presentation form and simulating the effect of moving a camera to capture still objects in real life. Through this method, existing image data can be effectively utilized for video training.
Effect display
The description texts are: "Timelapse at the snow land with aurora in the sky.", "A candle is burning .", "An epic tornado attacking above a glowing city at night.", and "Aerial view of a white sandy beach on the shores of a beautiful sea." More visualizations can be found on the project homepage.
Figure 3. Visual comparison of the generation effect with existing methods
Finally, as shown in Figure 3 shows the visual comparison of the generated results of this article with the existing methods Make-A-Video [3] and Imagen Video [4] respectively, showing the better quality generation effect of the model in this article.
The above is the detailed content of Fudan University and Huawei Noah propose the VidRD framework to achieve iterative high-quality video generation. For more information, please follow other related articles on the PHP Chinese website!

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Zend Studio 13.0.1
Powerful PHP integrated development environment