Home  >  Article  >  Technology peripherals  >  With just a picture and an action command, Animate124 can easily generate a 3D video

With just a picture and an action command, Animate124 can easily generate a 3D video

王林
王林forward
2024-01-12 11:12:10921browse
Animate124, easily turn a single picture into a 3D video.

In the past year, DreamFusion has led a new trend, that is, the generation of 3D static objects and scenes. The technology sector has attracted widespread attention. Looking back on the past year, we have witnessed significant advancements in quality and control of 3D static generation technology. Technology development started from text-based generation, gradually integrated into single-view images, and then developed to integrate multiple control signals.

Compared to this, 3D dynamic scene generation is still in its infancy. In early 2023, Meta launched MAV3D, marking the first attempt at generating 3D video based on text. However, limited by the lack of open source video generation models, progress in this field has been relatively slow.

However, now, 3D video generation technology based on the combination of graphics and text has come out!

Although text-based 3D video generation is capable of producing diverse content, it still has limitations in controlling the details and poses of objects. In the field of 3D static generation, 3D objects can be effectively reconstructed using a single image as input. Inspired by this, the research team from the National University of Singapore (NUS) and Huawei proposed the Animate124 model. This model combines a single image with a corresponding action description to enable precise control of 3D video generation.

With just a picture and an action command, Animate124 can easily generate a 3D video

  • Project homepage: https://animate124.github.io/
  • Paper address: https ://arxiv.org/abs/2311.14603
  • Code: https://github.com/HeliosZhao/Animate124

With just a picture and an action command, Animate124 can easily generate a 3D video

Core method

Method summary

According to static and dynamic, rough and fine optimization, this article divides 3D video generation into 3 stages: 1) Static generation stage: using the venison graph and 3D graph graph diffusion model to generate 3D objects from a single image; 2) Dynamic rough generation stage: use Vincent video model to optimize actions based on language description; 3) Semantic optimization stage: additionally use personalized fine-tuning ControlNet to optimize and improve the offset caused by language description in the second stage.

With just a picture and an action command, Animate124 can easily generate a 3D video

Figure 1. Overall framework

## Static generation

This article continues the Magic123 method, using Stable Diffusion and 3D Diffusion (
Zero-1-to-3
) Generate static objects based on pictures:

With just a picture and an action command, Animate124 can easily generate a 3D video For the perspective corresponding to the conditional picture, additionally use the loss function for optimization:

With just a picture and an action command, Animate124 can easily generate a 3D videoThrough the above two optimization goals, a multi-view 3D consistent static object is obtained (this stage is omitted in the frame diagram).


Dynamic rough generation

This stage mainly uses the
Vincent video diffusion model
, treat static 3D as the initial frame, and generate actions based on language description. Specifically, the dynamic 3D model (dynamic NeRF) renders multi-frame videos with continuous timestamps, inputs this video into the Vincent video diffusion model, and uses SDS distillation loss to optimize the dynamic 3D model:

With just a picture and an action command, Animate124 can easily generate a 3D videoUsing only the distillation loss of Vincent videos will cause the 3D model to forget the content of the picture, and random sampling will lead to insufficient training in the initial and end stages of the video. Therefore, the researchers in this paper oversampled the start and end timestamps. And, when sampling the initial frame, additional static functions are used for optimization (SDS distillation loss of 3D graphs):

Therefore, the loss function at this stage is:

With just a picture and an action command, Animate124 can easily generate a 3D video

Semantic optimization

Even with initial frame oversampling and additional supervision on it, the appearance of objects is still affected by the text during the optimization process using Vincent's video diffusion model, thus shifting the reference image. Therefore, this paper proposes a semantic optimization stage to improve semantic offset through a personalized model.

Since there is only a single picture, the Wensheng video model cannot be personalized. This article introduces a diffusion model based on images and text, and personalizes this diffusion model. Fine tune. This diffusion model should not change the content and actions of the original video, but only adjust the appearance. Therefore, this article adopts the ControlNet-Tile graphic model, uses the video frames generated in the previous stage as conditions, and optimizes according to the language. ControlNet is based on the Stable Diffusion model. It only requires personalized fine-tuning (Textual Inversion) of Stable Diffusion to extract the semantic information in the reference image. After personalized fine-tuning, treat the video as a multi-frame image and use ControlNet to supervise a single image:

With just a picture and an action command, Animate124 can easily generate a 3D video

In addition, because ControlNet uses rough pictures as conditions, classifier-free Guidance (CFG) can use a normal range (around 10) instead of using a very large value (usually 100) like the Vincent diagram and Vincent video model. Excessively large CFG will cause image oversaturation. Therefore, using the ControlNet diffusion model can alleviate the oversaturation phenomenon and achieve better generation results. The supervision at this stage is combined by the dynamic stage loss and ControlNet supervision:

With just a picture and an action command, Animate124 can easily generate a 3D video

Experimental results

As the first 3D video generation model based on graphics and text, this article compares it with two baseline models and MAV3D. Animate124 has better results compared to other methods.

Comparison of visual results

With just a picture and an action command, Animate124 can easily generate a 3D video

Figure 2. Animate124 vs. Comparison of two baselines

With just a picture and an action command, Animate124 can easily generate a 3D video

Figure 3.1. Animate124 and MAV3D Vincent 3D video comparison

With just a picture and an action command, Animate124 can easily generate a 3D video

##Figure 3.1. Animate124 and MAV3D Tusheng 3D video comparison

Comparison of Quantitative Results

This article uses CLIP and manual evaluation to generate quality. CLIP indicators include similarity to text and retrieval accuracy, and image quality. similarity, and temporal consistency. Manual evaluation indicators include similarity to text, similarity to pictures, video quality, realism of movements, and movement amplitude. Manual evaluation is represented by the ratio of a single model to Animate124's selection on the corresponding metric.

Compared with the two baseline models, Animate124 achieves better results in both CLIP and manual evaluation.

With just a picture and an action command, Animate124 can easily generate a 3D video

Table 1. Quantitative comparison between Animate124 and two baselines

Summary

Animate124 is the first method to turn any picture into a 3D video based on text description. It uses multiple diffusion models for supervision and guidance, optimizing the 4D dynamic representation network to generate high-quality 3D videos.

The above is the detailed content of With just a picture and an action command, Animate124 can easily generate a 3D video. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:jiqizhixin.com. If there is any infringement, please contact admin@php.cn delete