search
HomeTechnology peripheralsAIWith just a picture and an action command, Animate124 can easily generate a 3D video

Animate124, easily turn a single picture into a 3D video.

In the past year, DreamFusion has led a new trend, that is, the generation of 3D static objects and scenes. The technology sector has attracted widespread attention. Looking back on the past year, we have witnessed significant advancements in quality and control of 3D static generation technology. Technology development started from text-based generation, gradually integrated into single-view images, and then developed to integrate multiple control signals.

Compared to this, 3D dynamic scene generation is still in its infancy. In early 2023, Meta launched MAV3D, marking the first attempt at generating 3D video based on text. However, limited by the lack of open source video generation models, progress in this field has been relatively slow.

However, now, 3D video generation technology based on the combination of graphics and text has come out!

Although text-based 3D video generation is capable of producing diverse content, it still has limitations in controlling the details and poses of objects. In the field of 3D static generation, 3D objects can be effectively reconstructed using a single image as input. Inspired by this, the research team from the National University of Singapore (NUS) and Huawei proposed the Animate124 model. This model combines a single image with a corresponding action description to enable precise control of 3D video generation.

With just a picture and an action command, Animate124 can easily generate a 3D video

  • Project homepage: https://animate124.github.io/
  • Paper address: https ://arxiv.org/abs/2311.14603
  • Code: https://github.com/HeliosZhao/Animate124

With just a picture and an action command, Animate124 can easily generate a 3D video

Core method

Method summary

According to static and dynamic, rough and fine optimization, this article divides 3D video generation into 3 stages: 1) Static generation stage: using the venison graph and 3D graph graph diffusion model to generate 3D objects from a single image; 2) Dynamic rough generation stage: use Vincent video model to optimize actions based on language description; 3) Semantic optimization stage: additionally use personalized fine-tuning ControlNet to optimize and improve the offset caused by language description in the second stage.

With just a picture and an action command, Animate124 can easily generate a 3D video

Figure 1. Overall framework

## Static generation

This article continues the Magic123 method, using Stable Diffusion and 3D Diffusion (
Zero-1-to-3
) Generate static objects based on pictures:

With just a picture and an action command, Animate124 can easily generate a 3D video For the perspective corresponding to the conditional picture, additionally use the loss function for optimization:

With just a picture and an action command, Animate124 can easily generate a 3D videoThrough the above two optimization goals, a multi-view 3D consistent static object is obtained (this stage is omitted in the frame diagram).


Dynamic rough generation

This stage mainly uses the
Vincent video diffusion model
, treat static 3D as the initial frame, and generate actions based on language description. Specifically, the dynamic 3D model (dynamic NeRF) renders multi-frame videos with continuous timestamps, inputs this video into the Vincent video diffusion model, and uses SDS distillation loss to optimize the dynamic 3D model:

With just a picture and an action command, Animate124 can easily generate a 3D videoUsing only the distillation loss of Vincent videos will cause the 3D model to forget the content of the picture, and random sampling will lead to insufficient training in the initial and end stages of the video. Therefore, the researchers in this paper oversampled the start and end timestamps. And, when sampling the initial frame, additional static functions are used for optimization (SDS distillation loss of 3D graphs):

Therefore, the loss function at this stage is:

With just a picture and an action command, Animate124 can easily generate a 3D video

Semantic optimization

Even with initial frame oversampling and additional supervision on it, the appearance of objects is still affected by the text during the optimization process using Vincent's video diffusion model, thus shifting the reference image. Therefore, this paper proposes a semantic optimization stage to improve semantic offset through a personalized model.

Since there is only a single picture, the Wensheng video model cannot be personalized. This article introduces a diffusion model based on images and text, and personalizes this diffusion model. Fine tune. This diffusion model should not change the content and actions of the original video, but only adjust the appearance. Therefore, this article adopts the ControlNet-Tile graphic model, uses the video frames generated in the previous stage as conditions, and optimizes according to the language. ControlNet is based on the Stable Diffusion model. It only requires personalized fine-tuning (Textual Inversion) of Stable Diffusion to extract the semantic information in the reference image. After personalized fine-tuning, treat the video as a multi-frame image and use ControlNet to supervise a single image:

With just a picture and an action command, Animate124 can easily generate a 3D video

In addition, because ControlNet uses rough pictures as conditions, classifier-free Guidance (CFG) can use a normal range (around 10) instead of using a very large value (usually 100) like the Vincent diagram and Vincent video model. Excessively large CFG will cause image oversaturation. Therefore, using the ControlNet diffusion model can alleviate the oversaturation phenomenon and achieve better generation results. The supervision at this stage is combined by the dynamic stage loss and ControlNet supervision:

With just a picture and an action command, Animate124 can easily generate a 3D video

Experimental results

As the first 3D video generation model based on graphics and text, this article compares it with two baseline models and MAV3D. Animate124 has better results compared to other methods.

Comparison of visual results

With just a picture and an action command, Animate124 can easily generate a 3D video

Figure 2. Animate124 vs. Comparison of two baselines

With just a picture and an action command, Animate124 can easily generate a 3D video

Figure 3.1. Animate124 and MAV3D Vincent 3D video comparison

With just a picture and an action command, Animate124 can easily generate a 3D video

##Figure 3.1. Animate124 and MAV3D Tusheng 3D video comparison

Comparison of Quantitative Results

This article uses CLIP and manual evaluation to generate quality. CLIP indicators include similarity to text and retrieval accuracy, and image quality. similarity, and temporal consistency. Manual evaluation indicators include similarity to text, similarity to pictures, video quality, realism of movements, and movement amplitude. Manual evaluation is represented by the ratio of a single model to Animate124's selection on the corresponding metric.

Compared with the two baseline models, Animate124 achieves better results in both CLIP and manual evaluation.

With just a picture and an action command, Animate124 can easily generate a 3D video

Table 1. Quantitative comparison between Animate124 and two baselines

Summary

Animate124 is the first method to turn any picture into a 3D video based on text description. It uses multiple diffusion models for supervision and guidance, optimizing the 4D dynamic representation network to generate high-quality 3D videos.

The above is the detailed content of With just a picture and an action command, Animate124 can easily generate a 3D video. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
An easy-to-understand explanation of how to set up two-step authentication in ChatGPT!An easy-to-understand explanation of how to set up two-step authentication in ChatGPT!May 12, 2025 pm 05:37 PM

ChatGPT Security Enhanced: Two-Stage Authentication (2FA) Configuration Guide Two-factor authentication (2FA) is required as a security measure for online platforms. This article will explain in an easy-to-understand manner the 2FA setup procedure and its importance in ChatGPT. This is a guide for those who want to use ChatGPT safely. Click here for OpenAI's latest AI agent, OpenAI Deep Research ⬇️ [ChatGPT] What is OpenAI Deep Research? A thorough explanation of how to use it and the fee structure! table of contents ChatG

[For businesses] ChatGPT training | A thorough introduction to 8 free training options, subsidies, and examples![For businesses] ChatGPT training | A thorough introduction to 8 free training options, subsidies, and examples!May 12, 2025 pm 05:35 PM

The use of generated AI is attracting attention as the key to improving business efficiency and creating new businesses. In particular, OpenAI's ChatGPT has been adopted by many companies due to its versatility and accuracy. However, the shortage of personnel who can effectively utilize ChatGPT is a major challenge in implementing it. In this article, we will explain the necessity and effectiveness of "ChatGPT training" to ensure successful use of ChatGPT in companies. We will introduce a wide range of topics, from the basics of ChatGPT to business use, specific training programs, and how to choose them. ChatGPT training improves employee skills

A thorough explanation of how to use ChatGPT to streamline your Twitter operations!A thorough explanation of how to use ChatGPT to streamline your Twitter operations!May 12, 2025 pm 05:34 PM

Improved efficiency and quality in social media operations are essential. Particularly on platforms where real-time is important, such as Twitter, requires continuous delivery of timely and engaging content. In this article, we will explain how to operate Twitter using ChatGPT from OpenAI, an AI with advanced natural language processing capabilities. By using ChatGPT, you can not only improve your real-time response capabilities and improve the efficiency of content creation, but you can also develop marketing strategies that are in line with trends. Furthermore, precautions for use

[For Mac] Explaining how to get started and how to use the ChatGPT desktop app![For Mac] Explaining how to get started and how to use the ChatGPT desktop app!May 12, 2025 pm 05:33 PM

ChatGPT Mac desktop app thorough guide: from installation to audio functions Finally, ChatGPT's desktop app for Mac is now available! In this article, we will thoroughly explain everything from installation methods to useful features and future update information. Use the functions unique to desktop apps, such as shortcut keys, image recognition, and voice modes, to dramatically improve your business efficiency! Installing the ChatGPT Mac version of the desktop app Access from a browser: First, access ChatGPT in your browser.

What is the character limit for ChatGPT? Explanation of how to avoid it and upper limits by modelWhat is the character limit for ChatGPT? Explanation of how to avoid it and upper limits by modelMay 12, 2025 pm 05:32 PM

When using ChatGPT, have you ever had experiences such as, "The output stopped halfway through" or "Even though I specified the number of characters, it didn't output properly"? This model is very groundbreaking and not only allows for natural conversations, but also allows for email creation, summary papers, and even generate creative sentences such as novels. However, one of the weaknesses of ChatGPT is that if the text is too long, input and output will not work properly. OpenAI's latest AI agent, "OpenAI Deep Research"

What is ChatGPT's voice input and voice conversation function? Explaining how to set it up and how to use itWhat is ChatGPT's voice input and voice conversation function? Explaining how to set it up and how to use itMay 12, 2025 pm 05:27 PM

ChatGPT is an innovative AI chatbot developed by OpenAI. It not only has text input, but also features voice input and voice conversation functions, allowing for more natural communication. In this article, we will explain how to set up and use the voice input and voice conversation functions of ChatGPT. Even when you can't take your hands off, ChatGPT responds and responds with audio just by talking to you, which brings great benefits in a variety of situations, such as busy business situations and English conversation practice. A detailed explanation of how to set up the smartphone app and PC, as well as how to use each.

An easy-to-understand explanation of how to use ChatGPT for job hunting and job hunting!An easy-to-understand explanation of how to use ChatGPT for job hunting and job hunting!May 12, 2025 pm 05:26 PM

The shortcut to success! Effective job change strategies using ChatGPT In today's intensifying job change market, effective information gathering and thorough preparation are key to success. Advanced language models like ChatGPT are powerful weapons for job seekers. In this article, we will explain how to effectively utilize ChatGPT to improve your job hunting efficiency, from self-analysis to application documents and interview preparation. Save time and learn techniques to showcase your strengths to the fullest, and help you make your job search a success. table of contents Examples of job hunting using ChatGPT Efficiency in self-analysis: Chat

An easy-to-understand explanation of how to create and output mind maps using ChatGPT!An easy-to-understand explanation of how to create and output mind maps using ChatGPT!May 12, 2025 pm 05:22 PM

Mind maps are useful tools for organizing information and coming up with ideas, but creating them can take time. Using ChatGPT can greatly streamline this process. This article will explain in detail how to easily create mind maps using ChatGPT. Furthermore, through actual examples of creation, we will introduce how to use mind maps on various themes. Learn how to effectively organize and visualize your ideas and information using ChatGPT. OpenAI's latest AI agent, OpenA

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft