Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。-AI-php.cn

ホームページ

テクノロジー周辺機器

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

王林

Apr 07, 2024 am 09:01 AM

テンセント業界

AniPortrait モデルはオープンソースであり、自由に再生できます。

""Xiaopozhan Ghost Zone 用の新しい生産性ツール。"

最近、Tencent Open Source がリリースした新しいプロジェクトが Twitter でこのような評価を受けました。このプロジェクトは AniPortrait で、オーディオと参照画像に基づいて高品質のアニメーションポートレートを生成します。

早速、弁護士の手紙で警告されているデモを見てみましょう:

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

アニメーション画像簡単に話すこともできます:

このプロジェクトは、開始からわずか数日ですでに広く賞賛されており、GitHub スターの数は 2,800 を超えています。

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

AniPortrait の革新性を見てみましょう。

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

論文タイトル: AniPortrait: フォトリアリスティックなポートレートアニメーションのオーディオ駆動合成
論文アドレス: https ://arxiv.org/pdf/2403.17694.pdf
コードアドレス: https://github.com/Zejun-Yang/AniPortrait

#AniPortrait

Tencent が新たに提案した AniPortrait フレームワークには、Audio2Lmk と Lmk2Video という 2 つのモジュールが含まれています。

Audio2Lmk は、音声入力から複雑な顔の表情や唇の動きをキャプチャするランドマークシーケンスを抽出するために使用されます。 Lmk2Video は、このランドマークシーケンスを使用して、時間的に安定した一貫した高品質のポートレートビデオを生成します。

図 1 は、AniPortrait フレームワークの概要を示しています。

Audio2Lmk

#For a sequence of speech clips, the goal here is to predict the corresponding 3D face mesh sequence and gesture sequence.

The team used pre-trained wav2vec to extract audio features. The model generalizes well and can accurately recognize pronunciation and intonation in audio - crucial for generating realistic facial animations. By exploiting the obtained robust speech features, they can be efficiently converted into 3D face meshes using a simple architecture consisting of two fc layers. The team observed that this simple and straightforward design not only ensures accuracy but also improves the efficiency of the inference process.

In the task of converting audio into gestures, the backbone network used by the team is still the same wav2vec. However, the weights of this network are different from the audio-to-mesh module's network. This is because gestures are more closely related to rhythm and pitch in the audio, whereas audio-to-grid tasks focus on a different focus (pronunciation and intonation). To take the impact of previous states into account, the team employed a transformer decoder to decode the gesture sequence. In this process, the module uses a cross-attention mechanism to integrate audio features into the decoder. For the above two modules, the loss function used for training is a simple L1 loss.

After obtaining the mesh and pose sequence, use perspective projection to convert them into a 2D face landmark sequence. These Landmarks are the input signals for the next stage.

Lmk2Video

Given a reference portrait and a face Landmark sequence, The team's proposed Lmk2Video can create temporally consistent portrait animations. The animation process is about aligning the motion with the Landmark sequence while maintaining a consistent look with the reference image. The idea adopted by the team is to represent portrait animation as a sequence of portrait frames.

Lmk2Video’s network structure design is inspired by AnimateAnyone. The backbone network is SD1.5, which integrates a temporal motion module that effectively converts multi-frame noise input into a sequence of video frames.

In addition, they also used a ReferenceNet, which also uses the SD1.5 structure. Its function is to extract the appearance information of the reference image and integrate it into the backbone network. . This strategic design ensures that Face ID remains consistent throughout the output video.

Unlike AnimateAnyone, this increases the complexity of PoseGuider's design. The original version just integrated several convolutional layers, and then the Landmark features were fused with the latent features of the input layer of the backbone network. The Tencent team found that this rudimentary design could not capture the complex movements of lips. Therefore, they adopted ControlNet’s multi-scale strategy: integrating Landmark features of corresponding scales into different modules of the backbone network. Despite these improvements, the number of parameters in the final model is still quite low.

The team also introduced another improvement: using the Landmark of the reference image as an additional input. PoseGuider's cross-attention module facilitates interaction between reference landmarks and target landmarks in each frame. This process provides the network with additional clues that allow it to understand the connection between facial landmarks and appearance, which can help the portrait animation generate more precise movements.

Experiment

##Implementation details

The backbone network used in the Audio2Lmk stage is wav2vec2.0. The tool used to extract 3D meshes and 6D poses is MediaPipe. Audio2Mesh’s training data comes from Tencent’s internal dataset, which contains nearly an hour of high-quality speech data from a single speaker.

To ensure the stability of the 3D mesh extracted by MediaPipe, the performer's head position is stable and facing the camera during recording. Training Audio2Pose uses HDTF. All training operations are performed on a single A100, using the Adam optimizer, and the learning rate is set to 1e-5.

The Lmk2Video process uses a two-step training method.

#The initial step phase focuses on training the backbone network ReferenceNet and the 2D component of PoseGuider, regardless of the motion module. In subsequent steps, all other components will be frozen to focus on training the motion module. To train the model, two large-scale high-quality face video datasets are used here: VFHQ and CelebV-HQ. All data is passed through MediaPipe to extract 2D face landmarks. To improve the network's sensitivity to lip movements, the team's approach was to annotate the upper and lower lips with different colors when rendering pose images based on 2D Landmarks.

All images have been rescaled to 512x512.The model was trained using 4 A100 GPUs, with each step taking 2 days. The optimizer is AdamW and the learning rate is fixed at 1e-5.

Experimental results

As shown in Figure 2, the animation obtained by the new method is Excellent in quality and realism.

Up の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。

Additionally, users can edit the 3D representation in between, thereby modifying the final output. For example, users can extract Landmarks from a source and modify their ID information to achieve facial reproduction, as shown in the following video:

Please refer to the original paper for more details.

以上がUp の所有者はすでに、Tencent のオープンソース「AniPortrait」を悪用し、写真に歌わせたりしゃべらせたりし始めています。の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。

声明

この記事は机器之心で複製されています。侵害がある場合は、admin@php.cn までご連絡ください。

Huggingface smollmであなたの個人的なAIアシスタントを構築する方法Apr 18, 2025 am 11:52 AM

オンデバイスAIの力を活用：個人的なチャットボットCLIの構築最近では、個人的なAIアシスタントの概念はサイエンスフィクションのように見えました。ハイテク愛好家のアレックスを想像して、賢くて地元のAI仲間を夢見ています。

メンタルヘルスのためのAIは、スタンフォード大学でのエキサイティングな新しいイニシアチブによって注意深く分析されますApr 18, 2025 am 11:49 AM

AI4MHの最初の発売は2025年4月15日に開催され、有名な精神科医および神経科学者であるLuminary Dr. Tom Insel博士がキックオフスピーカーを務めました。 Insel博士は、メンタルヘルス研究とテクノでの彼の傑出した仕事で有名です

2025年のWNBAドラフトクラスは、成長し、オンラインハラスメントの成長と戦いに参加しますApr 18, 2025 am 11:44 AM

「私たちは、WNBAが、すべての人、プレイヤー、ファン、企業パートナーが安全であり、大切になり、力を与えられたスペースであることを保証したいと考えています」とエンゲルバートは述べ、女性のスポーツの最も有害な課題の1つになったものに取り組んでいます。アノ

Pythonビルトインデータ構造の包括的なガイド-AnalyticsVidhyaApr 18, 2025 am 11:43 AM

導入 Pythonは、特にデータサイエンスと生成AIにおいて、プログラミング言語として優れています。大規模なデータセットを処理する場合、効率的なデータ操作（ストレージ、管理、アクセス）が重要です。以前に数字とstをカバーしてきました

Openaiの新しいモデルからの代替案からの第一印象Apr 18, 2025 am 11:41 AM

潜る前に、重要な注意事項：AIパフォーマンスは非決定論的であり、非常にユースケース固有です。簡単に言えば、走行距離は異なる場合があります。この（または他の）記事を最終的な単語として撮影しないでください。これらのモデルを独自のシナリオでテストしないでください

AIポートフォリオ| AIキャリアのためにポートフォリオを構築する方法は？Apr 18, 2025 am 11:40 AM

傑出したAI/MLポートフォリオの構築：初心者と専門家向けガイド説得力のあるポートフォリオを作成することは、人工知能（AI）と機械学習（ML）で役割を確保するために重要です。このガイドは、ポートフォリオを構築するためのアドバイスを提供します

エージェントAIがセキュリティ運用にとって何を意味するのかApr 18, 2025 am 11:36 AM

結果？燃え尽き症候群、非効率性、および検出とアクションの間の隙間が拡大します。これは、サイバーセキュリティで働く人にとってはショックとしてはありません。しかし、エージェントAIの約束は潜在的なターニングポイントとして浮上しています。この新しいクラス

Google対Openai：学生のためのAIの戦いApr 18, 2025 am 11:31 AM

即時の影響と長期パートナーシップ？ 2週間前、Openaiは強力な短期オファーで前進し、2025年5月末までに米国およびカナダの大学生にChatGpt Plusに無料でアクセスできます。このツールにはGPT ‑ 4o、Aが含まれます。

See all articles

ホットAIツール

Undresser.AI Undress

リアルなヌード写真を作成する AI 搭載アプリ

AI Clothes Remover

写真から衣服を削除するオンライン AI ツール。

Undress AI Tool

脱衣画像を無料で

Clothoff.io

AI衣類リムーバー

AI Hentai Generator

AIヘンタイを無料で生成します。

ホットツール

MinGW - Minimalist GNU for Windows

このプロジェクトは osdn.net/projects/mingw に移行中です。引き続きそこでフォローしていただけます。 MinGW: GNU Compiler Collection (GCC) のネイティブ Windows ポートであり、ネイティブ Windows アプリケーションを構築するための自由に配布可能なインポートライブラリとヘッダーファイルであり、C99 機能をサポートする MSVC ランタイムの拡張機能が含まれています。すべての MinGW ソフトウェアは 64 ビット Windows プラットフォームで実行できます。