Home  >  Article  >  Technology peripherals  >  A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

王林
王林forward
2023-12-03 11:17:21767browse

Recently, a study conducted by Microsoft revealed how flexible the video processing software PS is

In this study, you only need to give the AI ​​a Taking photos, it can generate videos of the people in the photos, and the characters' expressions and movements can be controlled through text. For example, if the command you give is "open your mouth," the character in the video will actually open his mouth.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

If the command you give is "sad", she will make sad expressions and head movements.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

When the command "surprise" is given, the avatar's forehead lines are squeezed together.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

In addition, you can also provide a voice to synchronize the mouth shape and movements of the virtual character with the voice. Alternatively, you can provide a live video for the avatar to imitate

If you have more custom editing needs for the avatar's movements, such as making them nod, turn, or tilt their heads , this technology is also supported

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

This research is called GAIA (Generative AI for Avatar, generative AI for avatars), Its demo has begun to spread on social media. Many people admire its effect and hope to use it to "resurrection" the dead.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

But some people are worried that the continued evolution of these technologies will make online videos more difficult to distinguish between true and false, or be used by criminals to defraud. . It seems that anti-fraud measures will continue to be upgraded.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

What innovations does GAIA have?

Zero-sample talking avatar generation technology aims to synthesize natural videos based on speech, ensuring that the generated mouth shapes, expressions and head postures are consistent with the speech content. Previous research usually requires specific training or tuning of specific models for each virtual character, or utilizing template videos during inference to achieve high-quality results. Recently, researchers have focused on designing and improving methods for generating zero-shot talking avatars by simply using a portrait image of the target avatar as an appearance reference. However, these methods usually use domain priors such as warping-based motion representation and 3D Morphable Model (3DMM) to reduce the difficulty of the task. Such heuristics, while effective, may limit diversity and lead to unnatural results. Therefore, direct learning from data distribution is the focus of future research

In this article, researchers from Microsoft proposed GAIA (Generative AI for Avatar), which can learn from speech and leaflets Portrait images are synthesized into natural talking virtual character videos, eliminating domain priors in the generation process.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

Project address: https://microsoft.github.io/GAIA/Details of related projects can be found on this link

Paper link: https://arxiv.org/pdf/2311.15230.pdf

Gaia reveals two key insights:

  1. Use voice to drive the movement of the virtual character, while the background and appearance of the virtual character remain unchanged throughout the video. Inspired by this, this paper separates the motion and appearance of each frame, where the appearance is shared between frames, while the motion is unique to each frame. In order to predict motion from speech, this paper encodes motion sequences into motion latent sequences and uses a diffusion model conditioned on the input speech to predict the latent sequences;
  2. There is huge diversity in expressions and head gestures when a person is speaking a given content, which requires a large-scale and diverse data set. Therefore, this study collected a high-quality talking avatar dataset consisting of 16K unique speakers of different ages, genders, skin types, and speaking styles, making the generation results natural and diverse.

Based on the above two insights, this article proposes the GAIA framework, which consists of a variational autoencoder (VAE) (orange module) and a diffusion model (blue and green modules) )composition.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

VAE's main function is to decompose movement and appearance. It consists of two encoders (motion encoder and appearance encoder) and a decoder. During training, the input to the motion encoder is the current frame of facial landmarks, while the input to the appearance encoder is a randomly sampled frame in the current video clip.

According to these two The output of the encoder is then optimized to reconstruct the current frame. Once the trained VAE is obtained, the potential actions (i.e. the output of the motion encoder) are obtained for all training data

Then, this article uses a diffusion model to train to predict speech-based and a motion latent sequence of randomly sampled frames in a video clip, thus providing appearance information for the generation process

In the inference process, given a reference portrait image of the target avatar, the diffusion model transforms the image into And the input speech sequence is used as a condition to generate a motion potential sequence that conforms to the speech content. The generated motion latent sequence and reference portrait image are then passed through a VAE decoder to synthesize the speaking video output.

The study was structured in terms of data, collecting datasets from different sources, including High-Definition Talking Face Dataset (HDTF) and Casual Conversation datasets v1&v2 (CC v1&v2). In addition to these three datasets, the research also collected a large-scale internal speaking avatar dataset containing 7K hours of video and 8K speaker IDs. The statistical overview of the data set is shown in Table 1

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

In order to learn the required information, the article proposes several automatic filtering strategies to ensure training Quality of data:

  1. To make lip movements visible, the frontal direction of the avatar should be toward the camera;
  2. To ensure stability, Facial movements in the video should be smooth and should not shake rapidly;
  3. In order to filter out extreme cases where lip movements are inconsistent with speech, frames in which the avatar is wearing a mask or remaining silent should be deleted.

This article trains VAE and diffusion models on filtered data. Judging from the experimental results, this article has obtained three key conclusions:

  1. #GAIA can generate zero-sample speaking virtual characters, in terms of naturalness, diversity, and lip synchronization quality. and superior performance in terms of visual quality. According to the subjective evaluation of the researchers, GAIA significantly surpassed all baseline methods;
  2. The size of the training model ranged from 150M to 2B, and the results showed that GAIA is scalable because it is relatively small. Larger models produce better results;
  3. GAIA is a general and flexible framework that enables different applications, including controllable speaking avatar generation and text-command virtualization Character generation.

How does GAIA work?

During the experiment, the study compared GAIA with three powerful baselines, including FOMM, HeadGAN and Face-vid2vid. The results are shown in Table 2: VAE in GAIA achieves consistent improvements over previous video-driven baselines, demonstrating that GAIA successfully decomposes appearance and motion representations.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

Voice driven results. Speech-driven speaking avatar generation is achieved by predicting motion from speech. Table 3 and Figure 2 provide quantitative and qualitative comparisons of GAIA with MakeItTalk, Audio2Head, and SadTalker methods.

It is clear from the data that GAIA far outperforms all baseline methods in terms of subjective evaluation. More specifically, as shown in Figure 2, even if the reference image has closed eyes or an unusual head pose, the generation results of baseline methods are usually highly dependent on the reference image; in contrast, GAIA exhibits good performance on various reference images. Robust and produces results with higher naturalness, high lip synchronization, better visual quality, and motion diversity

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

According to Table 3, the best MSI score indicates that the video generated by GAIA has excellent motion stability. The Sync-D score of 8.528 is close to the real video score (8.548), indicating that the generated video has excellent lip synchronization. This study achieved comparable FID scores to the baseline, which may be affected by different head poses, as the study found that the model without diffusion training achieved better FID scores, as detailed in Table 6

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

The above is the detailed content of A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete