Home > Article > Technology peripherals > It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.
The name of this 3D generated diffusion model "Rodin" RODIN is inspired by the French sculpture artist Auguste Rodin.
With a 2D ID photo, you can design a 3D game avatar in just a few seconds!
This is the latest achievement of diffusion model in the 3D field. For example, just an old photo of the French sculptor Rodin can "transform" him into the game in minutes: △RODIN model is generated based on Rodin's old photo The 3D image can even modify the dress and image with just one sentence. Tell the AI to generate Rodin's "look wearing a red sweater and glasses": Don't like the big back? Then change to the "braided look": Try changing your hair color again? This is a "fashionable trendy person with brown hair", even the beard color is fixed: (The "fashionable trendy person" in the eyes of AI is indeed a bit too trendy) The latest 3D generated diffusion model "RODIN" (Roll-out Diffusion Network) above comes from Microsoft Research Asia. RODIN is also the first model to use the generative diffusion model to automatically generate 3D digital avatars (Avatar) on 3D training data. The paper has been accepted byCVPR 2023.
Lets come look. Directly use 3D data to train the diffusion modelThe name of this 3D generated diffusion model "Rodin" RODIN is inspired by the French sculpture artist Auguste Rodin. Previously, 2D generated 3D image models were usually obtained by training generative adversarial networks (GAN) or variational autoencoders (VAE) with 2D data, but the results were often unsatisfactory. Researchers analyzed that the reason for this phenomenon is that these methods have a basic underdetermined (ill posed) problem. That is, due to the geometric ambiguity of single-view images, it is difficult to learn the reasonable distribution of high-quality 3D avatars only through a large amount of 2D data, resulting in poor generation results. Therefore, this time they triedto directly use 3D data to train the diffusion model, mainly solving three problems:
First, 3D-aware convolution ensures the intrinsic correlation of the three planes after dimensionality reduction.
The 2D convolutional neural network (CNN) used in traditional 2D diffusion cannot handle Triplane feature maps well.3D-aware convolution does not simply generate three 2D feature planes, but considers its inherent three-dimensional characteristics when processing such 3D expressions, that is, the 2D features of one of the three view planes are essentially The projection of a straight line in 3D space is therefore related to the corresponding straight line projection features in the other two planes.
In order to achieve cross-plane communication, researchers consider such 3D correlations in convolution, thus efficiently synthesizing 3D details in 2D.
Second, latent space concerto three-plane 3D expression generation.
Researchers coordinate feature generation through latent vectors to make it globally consistent across the entire three-dimensional space, resulting in higher-quality avatars and semantic editing.
At the same time, an additional image encoder is also trained by using the images in the training dataset, which can extract semantic latent vectors as conditional inputs to the diffusion model.
In this way, the overall generative network can be regarded as an autoencoder, using the diffusion model as the decoding latent space vector. For semantic editability, the researchers adopted a frozen CLIP image encoder that shares the latent space with text prompts.
Third, hierarchical synthesis generates high-fidelity three-dimensional details.
The researchers used the diffusion model to first generate a low-resolution three-view plane (64×64), and then generated a high-resolution three-view plane (256×256) through diffusion upsampling. .
In this way, the basic diffusion model focuses on the overall 3D structure generation, while the subsequent upsampling model focuses on detail generation.
On the training data set, the researchers used the open source 3D rendering software Blender to randomly combine virtual 3D characters manually created by the artist images, coupled with random sampling from a large number of hair, clothes, expressions and accessories, to create 100,000 synthetic individuals, while rendering 300 multi-view images with a resolution of 256*256 for each individual.
In terms of generating text to 3D avatars, the researchers used the portrait subset of the LAION-400M data set to train the mapping from the input modality to the hidden space of the 3D diffusion model, and finally allowed the RODIN model to use only one A 2D image or a text description can create a realistic 3D avatar.
△Given a photo to generate an avatar
can not only change the image in one sentence, such as "a man with curly hair and a beard wearing a black leather jacket" ":
Even the gender can be changed at will, "Women in red clothes with African hairstyle": (Manual dog head)
The researchers also gave an application demo demonstration. Creating your own image only requires a few buttons:
△Use text to do 3D portrait editing
For more effects, you can click on the project address to view~
##△More randomly generated avatarsAfter making RODIN, the team’s next steps What's the plan? According to the authors of Microsoft Research Asia, RODIN's current works mainly focus on3D half-length portraits, which is also related to the fact that it mainly uses face data for training, but 3D image generation The demand is not limited to human faces.
Next, the team will consider trying to use RODIN models to create more 3D scenes, including flowers, trees, buildings, cars and homes, etc., to achieve the ultimate goal of "generating 3D everything with one model". Paper address:https://arxiv.org/abs/2212.06135
https://3d-avatar-diffusion.microsoft.com
The above is the detailed content of It only takes a few seconds to convert an ID photo into a digital person. Microsoft has achieved the first high-quality generation of 3D diffusion models, and you can change your appearance and appearance in just one sentence.. For more information, please follow other related articles on the PHP Chinese website!