search
HomeTechnology peripheralsAIA photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

Recently, a study conducted by Microsoft revealed how flexible the video processing software PS is

In this study, you only need to give the AI ​​a Taking photos, it can generate videos of the people in the photos, and the characters' expressions and movements can be controlled through text. For example, if the command you give is "open your mouth," the character in the video will actually open his mouth.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

If the command you give is "sad", she will make sad expressions and head movements.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

When the command "surprise" is given, the avatar's forehead lines are squeezed together.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

In addition, you can also provide a voice to synchronize the mouth shape and movements of the virtual character with the voice. Alternatively, you can provide a live video for the avatar to imitate

If you have more custom editing needs for the avatar's movements, such as making them nod, turn, or tilt their heads , this technology is also supported

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

This research is called GAIA (Generative AI for Avatar, generative AI for avatars), Its demo has begun to spread on social media. Many people admire its effect and hope to use it to "resurrection" the dead.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

But some people are worried that the continued evolution of these technologies will make online videos more difficult to distinguish between true and false, or be used by criminals to defraud. . It seems that anti-fraud measures will continue to be upgraded.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

What innovations does GAIA have?

Zero-sample talking avatar generation technology aims to synthesize natural videos based on speech, ensuring that the generated mouth shapes, expressions and head postures are consistent with the speech content. Previous research usually requires specific training or tuning of specific models for each virtual character, or utilizing template videos during inference to achieve high-quality results. Recently, researchers have focused on designing and improving methods for generating zero-shot talking avatars by simply using a portrait image of the target avatar as an appearance reference. However, these methods usually use domain priors such as warping-based motion representation and 3D Morphable Model (3DMM) to reduce the difficulty of the task. Such heuristics, while effective, may limit diversity and lead to unnatural results. Therefore, direct learning from data distribution is the focus of future research

In this article, researchers from Microsoft proposed GAIA (Generative AI for Avatar), which can learn from speech and leaflets Portrait images are synthesized into natural talking virtual character videos, eliminating domain priors in the generation process.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

Project address: https://microsoft.github.io/GAIA/Details of related projects can be found on this link

Paper link: https://arxiv.org/pdf/2311.15230.pdf

Gaia reveals two key insights:

  1. Use voice to drive the movement of the virtual character, while the background and appearance of the virtual character remain unchanged throughout the video. Inspired by this, this paper separates the motion and appearance of each frame, where the appearance is shared between frames, while the motion is unique to each frame. In order to predict motion from speech, this paper encodes motion sequences into motion latent sequences and uses a diffusion model conditioned on the input speech to predict the latent sequences;
  2. There is huge diversity in expressions and head gestures when a person is speaking a given content, which requires a large-scale and diverse data set. Therefore, this study collected a high-quality talking avatar dataset consisting of 16K unique speakers of different ages, genders, skin types, and speaking styles, making the generation results natural and diverse.

Based on the above two insights, this article proposes the GAIA framework, which consists of a variational autoencoder (VAE) (orange module) and a diffusion model (blue and green modules) )composition.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

VAE's main function is to decompose movement and appearance. It consists of two encoders (motion encoder and appearance encoder) and a decoder. During training, the input to the motion encoder is the current frame of facial landmarks, while the input to the appearance encoder is a randomly sampled frame in the current video clip.

According to these two The output of the encoder is then optimized to reconstruct the current frame. Once the trained VAE is obtained, the potential actions (i.e. the output of the motion encoder) are obtained for all training data

Then, this article uses a diffusion model to train to predict speech-based and a motion latent sequence of randomly sampled frames in a video clip, thus providing appearance information for the generation process

In the inference process, given a reference portrait image of the target avatar, the diffusion model transforms the image into And the input speech sequence is used as a condition to generate a motion potential sequence that conforms to the speech content. The generated motion latent sequence and reference portrait image are then passed through a VAE decoder to synthesize the speaking video output.

The study was structured in terms of data, collecting datasets from different sources, including High-Definition Talking Face Dataset (HDTF) and Casual Conversation datasets v1&v2 (CC v1&v2). In addition to these three datasets, the research also collected a large-scale internal speaking avatar dataset containing 7K hours of video and 8K speaker IDs. The statistical overview of the data set is shown in Table 1

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

In order to learn the required information, the article proposes several automatic filtering strategies to ensure training Quality of data:

  1. To make lip movements visible, the frontal direction of the avatar should be toward the camera;
  2. To ensure stability, Facial movements in the video should be smooth and should not shake rapidly;
  3. In order to filter out extreme cases where lip movements are inconsistent with speech, frames in which the avatar is wearing a mask or remaining silent should be deleted.

This article trains VAE and diffusion models on filtered data. Judging from the experimental results, this article has obtained three key conclusions:

  1. #GAIA can generate zero-sample speaking virtual characters, in terms of naturalness, diversity, and lip synchronization quality. and superior performance in terms of visual quality. According to the subjective evaluation of the researchers, GAIA significantly surpassed all baseline methods;
  2. The size of the training model ranged from 150M to 2B, and the results showed that GAIA is scalable because it is relatively small. Larger models produce better results;
  3. GAIA is a general and flexible framework that enables different applications, including controllable speaking avatar generation and text-command virtualization Character generation.

How does GAIA work?

During the experiment, the study compared GAIA with three powerful baselines, including FOMM, HeadGAN and Face-vid2vid. The results are shown in Table 2: VAE in GAIA achieves consistent improvements over previous video-driven baselines, demonstrating that GAIA successfully decomposes appearance and motion representations.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

Voice driven results. Speech-driven speaking avatar generation is achieved by predicting motion from speech. Table 3 and Figure 2 provide quantitative and qualitative comparisons of GAIA with MakeItTalk, Audio2Head, and SadTalker methods.

It is clear from the data that GAIA far outperforms all baseline methods in terms of subjective evaluation. More specifically, as shown in Figure 2, even if the reference image has closed eyes or an unusual head pose, the generation results of baseline methods are usually highly dependent on the reference image; in contrast, GAIA exhibits good performance on various reference images. Robust and produces results with higher naturalness, high lip synchronization, better visual quality, and motion diversity

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

According to Table 3, the best MSI score indicates that the video generated by GAIA has excellent motion stability. The Sync-D score of 8.528 is close to the real video score (8.548), indicating that the generated video has excellent lip synchronization. This study achieved comparable FID scores to the baseline, which may be affected by different head poses, as the study found that the model without diffusion training achieved better FID scores, as detailed in Table 6

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.

The above is the detailed content of A photo generates a video. Opening the mouth, nodding, emotions, anger, sorrow, and joy can all be controlled by typing.. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.