Home > Article > Technology peripherals > Google is the first to release video generation AIGC, netizens: you can customize movies
We know that advances in generative models and multimodal visual language models have paved the way for large-scale text-to-image models with unprecedented generative realism and diversity. These models offer new creative processes, but are limited to compositing new images rather than editing existing ones. To bridge this gap, intuitive text-based editing methods can perform text-based editing of generated and real images and preserve some of the original properties of these images. Similar to images, many text-to-video models have been proposed recently, but there are few methods using these models for video editing.
In text-guided video editing, the user provides an input video along with a text prompt that describes the expected properties of the generated video, as shown in Figure 1 below. The goals have the following three aspects, 1) Alignment, the edited video should conform to the input text prompt; 2) Fidelity, the edited video should retain the content of the original video, 3) Quality, the edited video should have high quality .
As you can see, video editing is more challenging than image editing, it requires synthesizing new actions rather than just modifying the visual appearance. There is also a need to maintain temporal consistency. Therefore, applying image-level editing methods such as SDEdit and Prompt-to-Prompt to video frames is not enough to achieve good results.
##In a paper recently published by Google Research and others on arXiv, Researchers proposed a new method, Dreamix, which was inspired by UniTune and applied the text conditional video diffusion model (VDM) to video editing.
The text conditional VDM maintains high fidelity to the input video through the following two main ideas . One does not use pure noise as model initialization, but uses a downgraded version of the original video to retain only low spatiotemporal information by reducing the size and adding noise; the other is to further improve the fidelity of the original video by fine-tuning the generative model on the original video Spend.
Fine-tuning ensures that the model understands the high-resolution properties of the original video. Simple fine-tuning of the input video contributes to relatively low motion editability because the model learns to prefer raw motion rather than following text prompts. We propose a novel hybrid fine-tuning method in which the VDM is also fine-tuned on a set of individual frames of the input video and discards their timing. Blend fine-tuning significantly improves the quality of motion editing.The researchers further used their video editing model to propose a
new image animation framework, as shown in Figure 2 below. The framework consists of several steps, such as animating objects and backgrounds in images, creating dynamic camera movements, and more. They do this through simple image processing operations such as frame copying or geometric image transformations, creating crude videos. Then use the Dreamix video editor to edit the video. In addition, the researchers also used their fine-tuning method for goal-driven video generation, which is the video version of Dreambooth.
In the experimental display part, the researchers conducted extensive qualitative research and manual evaluation. Demonstrating the powerful capabilities of their method, please refer to the following animation for details.
For this Google study, it was stated that 3D motion and editing tools Might be a popular topic for the next wave of papers.
Someone else said: You can soon make your own movie on a budget, all you need is a green screen and this technology:
This article proposes a new method for video editing, specifically :
Text-guided video editing by reverse engineering destroyed videos
They use cascade VDM (Video Diffusion Models ), first destroy the input video to a certain extent through downsampling, and then add noise. Next a cascade diffusion model is used for the sampling process and conditional on time t to upscale the video to the final temporal-spatial resolution.
In the process of destroying the input video, you first need to perform a downsampling operation to obtain the basic model (16 frames 24 × 40), and then add The variance is Gaussian noise, further corrupting the input video.
For the above processed video, the next step is to use cascaded VDM to map the damaged low-resolution video to a high-resolution video aligned with the text . The core idea here is that given a noisy, very low temporal and spatial resolution video, there are many perfectly feasible, high-resolution videos corresponding to it. The basic model in this paper starts from a corrupted video, which has the same noise as the diffusion process at time s. The study then used VDM to reverse the diffusion process until time 0. Finally, the video is upgraded through the super-resolution model.
Hybrid video image fine-tuning
Using only the input video to fine-tune the video diffusion model will limit the changes in object motion. Instead, this study uses a hybrid target, that is, in addition to the original target (lower left corner), this paper also performs fine-tuning on an unordered set of frames. This is done through "masked temporal attention" to prevent temporal attention. Forces and convolutions are fine-tuned (bottom right). This operation allows adding motion to static videos.
Inference
In the application Based on pre-processing (Application Dependent Pre-processing, left in the figure below), this research supports multiple applications and can convert input content into a unified video format. For image-to-video, the input image is copied and transformed, synthesizing a rough video with some camera motion; for object-driven video generation, its input is omitted and fine-tuned separately to maintain fidelity. This rough video was then edited using the Dreamix Video Editor (right): as mentioned earlier, the video was first destroyed by downsampling, adding noise. A fine-tuned text-guided video diffusion model is then applied to upscale the video to its final temporal and spatial resolution.
Experimental resultsVideo editing: In the picture below, Dreamix changes the action to dance, and the appearance changes from a monkey to a bear, But the basic attributes of the subject in the video have not changed:
Image to video: When the input is an image, Dreamix can add new moving objects using its video prior, as follows A unicorn appears in a foggy forest and is zoomed in.
Penguins appeared next to the hut:
Goal-driven video generation: Dreamix can also take a collection of images showing the same subject and generate a new video with that subject as a moving object. The picture below shows a caterpillar wriggling on a leaf:
In addition to qualitative analysis, the study also conducted baseline comparisons, mainly using Dreamix Compare with two baseline methods: Imagen-Video and Plug-and-Play (PnP). The following table shows the scoring results:
Figure 8 shows a video edited by Dreamix and two baseline examples: text to The video model enables low-fidelity editing because it is not conditioned on the original video. PnP preserves the scene but lacks consistency from frame to frame; Dreamix performs well on all three goals.
Please refer to the original paper for more technical details.
The above is the detailed content of Google is the first to release video generation AIGC, netizens: you can customize movies. For more information, please follow other related articles on the PHP Chinese website!