


With just a photo and a piece of audio, you can directly generate a video of the character talking!
Recently, researchers from Google released the multi-modal diffusion model VLOGGER, taking us one step closer to virtual digital humans.
Paper address: https://enriccorona.github.io/vlogger/paper.pdf
Vlogger can collect a single input image, use text or audio driver, and generate a video of human speech, including mouth shapes, expressions, body movements, etc., which are very natural.
Let’s look at a few examples first:
If you feel that the use of other people’s voices in the video is a bit inconsistent, the editor will help you turn off the sound:
It can be seen that the entire generated effect is very elegant and natural.
VLOGGER builds on recent successes in generating diffusion models, including a model that translates humans into 3D motion, and a new diffusion-based architecture for control through time and space , enhance the effect of text-generated images.
VLOGGER can generate high-quality videos of variable length, and these videos can be easily controlled with high-level representations of faces and bodies.
For example, we can make the people in the generated video shut up:
Or close your eyes:
Compared with previous similar models, VLOGGER does not need to be trained on individuals and does not rely on Face detection and cropping, but also body movements, torso and background, constitute a normal human representation that can communicate.
AI voice, AI expression, AI action, AI scene, the value of human beings at the beginning is to provide data, but may it have no value in the future?
On the data side, the researchers collected a new, diverse dataset, MENTOR, than The previous similar data set was an entire order of magnitude larger. The training set included 2,200 hours and 800,000 different individuals, and the test set included 120 hours and 4,000 people with different identities.
The researchers evaluated VLOGGER on three different benchmarks, showing that the model achieves state-of-the-art performance in image quality, identity preservation, and temporal consistency. of optimal.
VLOGGER
The goal of VLOGGER is to generate a variable-length realistic video depicting the entire process of the target person speaking, Includes head movements and gestures.
As shown above, given a single input image shown in column 1 and a sample audio input, a series of Composite image.
Including generating head movements, gazes, blinks, lip movements, and something that previous models were unable to do, generating upper body and gestures, which is a major advancement in audio-driven synthesis.
VLOGGER adopts a two-stage pipeline based on a random diffusion model to simulate one-to-many mapping from speech to video.
The first network takes audio waveforms as input to generate body motion controls responsible for gaze, facial expressions and gestures over the length of the target video.
The second network is a temporal image-to-image translation model that extends the large image diffusion model to employ predicted body control to generate corresponding frames. To align this process with a specific identity, the network obtains a reference image of the target person.
VLOGGER uses a statistically based 3D body model to regulate the video generation process. Given an input image, the predicted shape parameters encode the geometric properties of the target identity.
First, the network M takes the input speech and generates a series of N frames of 3D facial expressions and body poses.
A dense representation of the moving 3D body is then rendered to act as a 2D control during the video generation stage. These images, along with the input images, serve as input to the temporal diffusion model and super-resolution modules.
Audio-driven motion generation
The first network of the pipeline is designed to predict motion based on input speech. In addition, the input text is converted into a waveform through a text-to-speech model, and the generated audio is represented as standard Mel-Spectrograms.
The pipeline is based on the Transformer architecture and has four multi-head attention layers in the time dimension. Includes positional encoding of frame number and diffusion step, as well as embedding MLP for input audio and diffusion step.
In each frame, use a causal mask to make the model only focus on the previous frame. The model is trained using variable length videos (such as the TalkingHead-1KH dataset) to generate very long sequences.
The researchers employ statistically based estimated parameters of a 3D human body model to generate intermediate control representations for synthetic videos.
The model takes into account both facial expressions and body movements to generate better expressive and dynamic gestures.
In addition, previous face generation work usually relies on warped images, but this method has been ignored in diffusion-based architectures.
The authors recommend using distorted images to guide the generation process, which facilitates the network’s task and helps maintain the subject identity of the character.
Generate talking and moving humans
#The next goal is to perform motion processing on the input image of a person , making it follow previously predicted body and facial movements.
Inspired by ControlNet, the researchers froze the initially trained model and used input time controls to make a zero-initialized trainable copy of the encoding layer.
The author interleaves one-dimensional convolutional layers in the time domain. The network is trained by obtaining consecutive N frames and controls, and generates action videos of reference characters based on the input controls.
The model is trained using the MENTOR data set built by the author. Because during the training process, the network will obtain a series of continuous frames and arbitrary reference images, so in theory any video can be Frame specified as reference.
In practice, however, the authors choose to sample references further away from the target clip because closer examples offer less generalization potential.
The network is trained in two stages, first learning a new control layer on a single frame, and then training on the video by adding a temporal component. This allows the use of large batches in the first stage and faster learning of head-replay tasks.
The learning rate adopted by the author is 5e-5, and the image model is trained with a step size of 400k and a batch size of 128 in both stages.
Diversity
The following figure shows the diverse distribution of target videos generated from an input image. The rightmost column shows the pixel diversity obtained from the 80 generated videos.
The person's head and body move significantly while the background remains fixed (red means higher diversity of pixel colors) , and, despite the diversity, all videos look realistic.
Video Editing
One of the applications of the model is to edit existing video. In this case, VLOGGER takes a video and changes the subject's expression by closing their mouth or eyes, for example.
In practice, the author takes advantage of the flexibility of the diffusion model to repair the parts of the image that should be changed, making the video edit consistent with the original unchanged pixels.
Video Translation
One of the main applications of the model is video translation. In this case, VLOGGER takes an existing video in a specific language and edits the lips and facial areas to align with the new audio (e.g. Spanish).
The above is the detailed content of An AI video can be generated from just one picture! Google's new diffusion model makes characters move. For more information, please follow other related articles on the PHP Chinese website!

译者 | 布加迪审校 | 孙淑娟目前,没有用于构建和管理机器学习(ML)应用程序的标准实践。机器学习项目组织得不好,缺乏可重复性,而且从长远来看容易彻底失败。因此,我们需要一套流程来帮助自己在整个机器学习生命周期中保持质量、可持续性、稳健性和成本管理。图1. 机器学习开发生命周期流程使用质量保证方法开发机器学习应用程序的跨行业标准流程(CRISP-ML(Q))是CRISP-DM的升级版,以确保机器学习产品的质量。CRISP-ML(Q)有六个单独的阶段:1. 业务和数据理解2. 数据准备3. 模型

人工智能(AI)在流行文化和政治分析中经常以两种极端的形式出现。它要么代表着人类智慧与科技实力相结合的未来主义乌托邦的关键,要么是迈向反乌托邦式机器崛起的第一步。学者、企业家、甚至活动家在应用人工智能应对气候变化时都采用了同样的二元思维。科技行业对人工智能在创建一个新的技术乌托邦中所扮演的角色的单一关注,掩盖了人工智能可能加剧环境退化的方式,通常是直接伤害边缘人群的方式。为了在应对气候变化的过程中充分利用人工智能技术,同时承认其大量消耗能源,引领人工智能潮流的科技公司需要探索人工智能对环境影响的

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

条形统计图用“直条”呈现数据。条形统计图是用一个单位长度表示一定的数量,根据数量的多少画成长短不同的直条,然后把这些直条按一定的顺序排列起来;从条形统计图中很容易看出各种数量的多少。条形统计图分为:单式条形统计图和复式条形统计图,前者只表示1个项目的数据,后者可以同时表示多个项目的数据。

arXiv论文“Sim-to-Real Domain Adaptation for Lane Detection and Classification in Autonomous Driving“,2022年5月,加拿大滑铁卢大学的工作。虽然自主驾驶的监督检测和分类框架需要大型标注数据集,但光照真实模拟环境生成的合成数据推动的无监督域适应(UDA,Unsupervised Domain Adaptation)方法则是低成本、耗时更少的解决方案。本文提出对抗性鉴别和生成(adversarial d

数据通信中的信道传输速率单位是bps,它表示“位/秒”或“比特/秒”,即数据传输速率在数值上等于每秒钟传输构成数据代码的二进制比特数,也称“比特率”。比特率表示单位时间内传送比特的数目,用于衡量数字信息的传送速度;根据每帧图像存储时所占的比特数和传输比特率,可以计算数字图像信息传输的速度。

数据分析方法有4种,分别是:1、趋势分析,趋势分析一般用于核心指标的长期跟踪;2、象限分析,可依据数据的不同,将各个比较主体划分到四个象限中;3、对比分析,分为横向对比和纵向对比;4、交叉分析,主要作用就是从多个维度细分数据。

在日常开发中,对数据进行序列化和反序列化是常见的数据操作,Python提供了两个模块方便开发者实现数据的序列化操作,即 json 模块和 pickle 模块。这两个模块主要区别如下:json 是一个文本序列化格式,而 pickle 是一个二进制序列化格式;json 是我们可以直观阅读的,而 pickle 不可以;json 是可互操作的,在 Python 系统之外广泛使用,而 pickle 则是 Python 专用的;默认情况下,json 只能表示 Python 内置类型的子集,不能表示自定义的


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Dreamweaver Mac version
Visual web development tools

SublimeText3 Chinese version
Chinese version, very easy to use

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

SublimeText3 Linux new version
SublimeText3 Linux latest version
