Home >Technology peripherals >AI >Digital people light the main torch of the Asian Games, and this ICCV paper reveals Ant's generative AI black technology
Open a digital human, which is full of generative AI.
On the evening of September 23, at the opening ceremony of the Hangzhou Asian Games, the lighting of the main torch showed the "little flames" of hundreds of millions of online digital torchbearers gathering on the Qiantang River. A digital human image is formed. Then, the digital human torchbearer and the sixth torchbearer on site walked to the torch stage together and lit the main torch together
As the core idea of the opening ceremony, the digital torch bearer The Internet's torch-lighting form has become a hot search topic and attracted people's attention. Rewritten content: As the core idea of the opening ceremony, the torch lighting method of Digital Reality Internet has aroused heated discussions and attracted people's attention.
Digital People Ignition is an unprecedented initiative, with hundreds of millions of people participating. , involving a large number of advanced and complex technologies. One of the most important issues is how to make digital people "move". It can be clearly seen that with the rapid development of generative artificial intelligence and large-scale models, more new changes have appeared in digital human research
At the upcoming global computer vision conference ICCV 2023 in early October, We noticed that a study on generating 3D digital human motion was included in the conference. The related paper is titled "Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models" and was jointly published by Zhejiang University and Ant Group.
According to the introduction, this research solves to a certain extent the problem of digital humans synthesizing complex movements over long distances, and can achieve effects that cannot be achieved with original models or path planning. Technology related to digital human driving has also been used in the online delivery of 100 million digital human beings in the Asian Games
Generative AI driver to make digital humans move
Many times , we need to synthesize 3D human motion in a given 3D scene so that virtual humans can naturally walk around the scene and interact with objects. This effect has many applications in AR/VR, film production, and video games.
Here, traditional character control motion generation methods aim to generate short-term or repetitive motions guided by the user's control signals, while new research focuses on generating a given starting position and target object model. Longer human-computer interaction content.
Although this idea is more effective, it is obviously more challenging. First, human-object interactions should be coherent, which requires the ability to model long-range interactions between humans and objects. Second, in the context of content generation, generative models should be able to synthesize motions of different sizes, since there are multiple ways for real people to approach and interact with target objects.
In terms of methods for generating digital human movements, existing synthesis methods can be roughly divided into online generation and offline generation. Most online methods focus on real-time control of the character. Given a target object, they typically use autoregressive models to cyclically generate future motion through feedback predictions. Although this method has been widely used in interactive scenarios such as video games, its quality is still unsatisfactory for long-term generation.
In order to improve the quality of motion, some recent offline methods have begun to adopt a multi-level framework, first generating trajectories and then synthesizing motion. Although this strategy can produce reasonable paths, the diversity of paths is limited
In this new study, the authors propose a new offline method for synthesizing long-term and diverse Interaction between people and objects. The innovation of this method lies in the hierarchical generation strategy. First, the strategy predicts a series of milestones and then generates human actions between those milestones
Specifically, given a starting position and a target object, the author designed a milestone generation module to synthesize a set of nodes along the movement trajectory. Each milestone encodes the local pose and indicates the transition during human movement. point. Based on these milestones, the algorithm employs a motion generation module to generate complete motion sequences. Thanks to the existence of these milestones, we can simplify the generation of long sequences to the synthesis of several short motion sequences.
The local pose of each milestone is generated by a transformer model that considers global dependencies to produce time-consistent results, further facilitating coherent motion
In addition to the hierarchical generation framework, The researchers further used diffusion models to synthesize human-object interactions. Some previous motion synthetic diffusion models combined transformers and denoising diffusion probabilistic models (DDPM).
It is worth mentioning that due to the long motion sequences, applying them directly to the new settings requires a lot of calculations and may cause GPU memory explosion. Because the new hierarchical generation framework converts long-term generation into the synthesis of multiple short sequences, the GPU memory required is reduced to the same level as short-term motion generation.
Therefore, researchers can effectively use Transformer DDPM to synthesize long-term motion sequences, thereby improving the generation quality
To achieve this purpose, researchers designed a hierarchical generation framework, as shown in the figure below Show
First, they use GoalNet to predict interaction targets on objects, and then generate target poses to explicitly model human-object interactions. Next, they use the milestone generation module to estimate the length of the milestone, thereby generating the milestone trajectory from the starting point to the target, and place the milestone pose
In this way, the long-distance motion generation is decomposed into multiple short-distance Motion generated combinations. Finally, the authors designed a motion generation module to synthesize trajectories between milestones and fill in actions.
Artificial Intelligence (AI) Posture Generation
Researchers refer to the posture in which a person interacts with an object and remains stationary as the target posture. Previously, most methods used cVAE models to generate human poses, but researchers found that this method performed poorly in their own studies.
To address this challenge, they adopted the VQ-VAE model to model the data distribution. This model utilizes discrete representation to cluster data into a limited set of points. Furthermore, based on observations, different human poses may have similar properties. For example, when a person is sitting down, the hand movements may be different, but the leg position may be the same. Therefore, they divided the joints into L (L = 5) different non-overlapping groups
As shown in Figure 3, the target pose was divided into independent joint groups
Based on the starting pose and target pose, we can let the algorithm generate the milestone trajectory and synthesize the local pose at the milestone. Since the length of the motion data is unknown and can be arbitrary (for example, a person may quickly walk to the chair and sit down, or he may walk slowly around the chair and then sit down), it is necessary to predict the length of the milestone, represented by N . Then, N landmark points are synthesized and local poses are placed on these points.
The last step is action generation. The method used by the researchers is not to predict actions frame by frame, but to synthesize the entire sequence hierarchically based on the generated milestones. They first generate trajectories and then synthesize actions. Specifically, within two consecutive milestones, they complete the trajectory first. Then, fill in the movement guided by successive milestone gestures. These two steps are completed using two Transformer DDPM respectively.
The researcher will carefully design the conditions of DDPM for each step to generate the target output
The rewritten content is: the effect of being ahead of other products
The researchers compared the results of different methods on the SAMP dataset. It can be seen that the method proposed in the paper has lower FD, higher user research score and higher APD. Furthermore, their method achieves higher trajectory diversity than SAMP.
This new method can produce satisfactory results in complex scenes. The percentage of penetration frames generated by this method is 3.8%, and that of SAMP is 4.9%
On SAMP, COUCH and other data sets, the methods mentioned in the study have achieved Better results than baseline methods
Complete full-link layout
Digital human is a multi-modal combination of voice, semantics, vision, etc. A combination of dynamic technologies. While generative AI has recently made breakthroughs, the field of digital humans is experiencing leapfrog development. The modeling, generation interaction, rendering and other aspects that previously required manual work are now being fully artificialized. As engineers continue to Optimization, the experience of this technology on the mobile terminal is also getting better. The just-concluded online Asian Games torch relay event is a good example: if we want to become a torch bearer, we only need to click on the mini program of the Alipay App.
It is said that in order to ensure the smooth progress of the opening ceremony project, Ant Group’s engineers conducted more than 100,000 tests on hundreds of different models of mobile phones, typed more than 200,000 lines of code, and passed self-research The combination of Web3D interactive engine Galacean, AI digital human, cloud services, blockchain and other technologies ensures that everyone can become a digital torchbearer and participate in the torch relay. The Asian Games Digital Torchbearer Platform can reach hundreds of millions of users and supports 97% of common smartphone devices.
In order to allow digital torchbearers to participate realistically, Ant’s technical team developed 58 face-pinching controllers. By using facial recognition and AI algorithms, they can map a digital torchbearer's face based on each person's facial features. At the same time, users can also freely adjust face shape, hairstyle, nose, mouth, eyebrows and other features to achieve free dress-up. This technology can provide 2 trillion different digital image choices
In addition, after the opening ceremony lighting ceremony, each digital torch bearer can receive an exclusive digital ignition certificate with each digital torch painted on it. With a unique image of your hand, this certificate will be stored on the blockchain through distributed technology.
It is easy to see from the content of the research paper and the Asian Games projects that there is support from a complete digital human technology system behind it. It is understood that Ant Group is actively exploring digital human technology and has completed the self-research layout of the full-link core technology of digital human.
Unlike most companies on the market, Ant Group’s digital human technology is self-developed and has chosen a development direction that is combined with generative AI. In terms of technical deployment, it covers the entire life cycle of digital human modeling, rendering, driving, and interaction. Combining AIGC and large models significantly reduces the full-link production cost of digital humans. Currently, it can support 2D and 3D digital people, and provides a variety of solutions such as broadcast type and interactive type.
In addition to the Asian Games, the Ant Digital People Platform also supports Ant Group’s Alipay, digital finance, government affairs, Wufu and other businesses, and this year began to apply it to short videos, live broadcasts, mini programs and other carriers to partners Provide basic services.
It can be predicted that in the near future, as digital humans powered by generative AI continue to upgrade, we will also experience better interactions in more scenarios, and truly enter a smart life integrating digital and real things.
The above is the detailed content of Digital people light the main torch of the Asian Games, and this ICCV paper reveals Ant's generative AI black technology. For more information, please follow other related articles on the PHP Chinese website!