Home >Technology peripherals >AI >Wang Wenbing, head of Rokid algorithm: 'Sound' under AR is in a 'wonderful' state
Sound is ubiquitous in our daily lives and is an indispensable part, and the same is true in the metaverse world. In order to achieve a full range of immersion in the scenes of the Metaverse, the continuous upgrading and development of various sound technologies are required. At the "AISummit Global Artificial Intelligence Technology Conference" held recently by 51CTO, Wang Wenbing, the head of Rokid algorithm, published a speech The keynote speech "Sound in AR under "Wonderful" Land" introduced the concept of Rokid's self-developed 6DoF spatial sound field, the main technical modules, technical difficulties, the development trend of combining with AR and the original intention of developing the technology, explaining the spatial sound field An important manifestation of technology in the metaverse world.
The speech content is now organized as follows:
When talking about this issue, you can first put aside the technical limitations and imagine how the sound on AR should be presented. In fact, most of the TVs and mobile phones we use now are two-channel like stereo. Home theaters have already used multi-channel. Professional scenes such as movie theaters also have speakers in the spatial layout.
How should it be presented on AR? We can imagine a scene, such as online meetings or online education that are very popular now. If you see the digital person on the right talking all the time in the metaverse world, but the voice comes from your left, does it feel weird at this time?
In addition, we can imagine an AR game. In the previous 2D vision, the sound can move with the focus of the vision, but in the 360-degree range of the 3D scene, Human eyes cannot grasp the entire visual focus, but sound has global focus. This is why in many games, people will switch perspectives according to the sound. Therefore, we can see some of the characteristics that sound on AR needs to have: it needs to meet people's high sensitivity to sound, the global focus of sound, and the realism requirements of sound.
Next, we will introduce the development path of sound form from three dimensions.
First, the spatial expression dimension. The expression dimension of the entire sound ranges from mono/stereo to multi-channel in the plane of 5.1/7.1/9.1/..., to multi-channel in the space of 5.1.x/7.1.x, etc. There are more and more speakers, and their placement has increased from plane to space;
Second, the dimension of encoding methods. From the very beginning, channel-based (that is, channel-based encoding, each channel will have a variety of sounds, such as our usual left and right channel expressions), to object-based (also That is to code the object that happened), including the Dolby Atmos film source that everyone watched in the cinema. For example, when a cannonball is shot down, the object of that cannonball is specially coded, and its movement trajectory is recorded in the metadata, and then Playback is based on the corresponding speaker position; but our ultimate goal is to achieve an effect completely based on the scene, similar to the panoramic sound method such as HOA, not just the cannonballs, we all hope that every flower, grass and leaf will fall. It has a sense of space.
#Third, the XR experience dimension. In the past, virtual sound was separated from the real world. Now in XR, especially in AR, what we have been doing is the integration of virtual and reality.
The reason why people can distinguish sounds in such fine detail is because of the binaural mode, technically speaking it is ITD and ILD, which is the time difference and sound intensity difference between the two ears. These two differences will help us quickly locate the direction of the object's sound.
So how to make 3D sound popular? How to break through venue limitations? How to reduce user consumption costs? How can everyone enjoy technology? Rokid's self-developed 6dof spatial sound field will help solve these problems.
6dof spatial sound field can be divided into two parts from the name: 6dof and spatial sound field. 6dof mainly expresses six degrees of freedom. The gyroscope provides rotation around the three directions of XYZ, and the accelerometer provides acceleration in the three directions of XYZ.
6dof spatial sound field involves the generation, dissemination, rendering, encoding and decoding of sound, as well as the fusion and interaction of virtual and real sounds throughout the process.
The main technical modules of 6dof spatial sound field include HRTFs, sound field rendering and sound effects. HRTFs is the impact function of the sound source from the free field to the eardrum. It is the process of transmitting all-round sound to the human ear in a simulated anechoic chamber environment. Sound field rendering can give people the ability to distinguish the position of sounds by listening, and can blend virtual and real objects to perfectly handle the impact of real objects on virtual sound sources. The sound effect is to enrich the sound quality by using open speakers designed for privacy to reduce sound leakage and ensure volume.
The SDK at the top of the architecture diagram provides external spatial modules, namely the spatial engine export and the speech engine export. Spatial information can be acquired and modeled, helping to integrate the digital and physical worlds.
In addition, we have also made some modifications to Room Effect. Its overall framework is similar to the classic network structure. First, the network is constructed, and then a theoretical lossless network is generated. Then, based on this theory, various attenuation and loss related settings are made, including absorption, occlusion, reflection, etc. In fact, our own purpose is not to produce various sound effects. We just provide sound effects based on the usage scenarios of the product, such as theaters or music, so that users can achieve a good audio-visual experience. These can be experienced on the next-generation AR glasses Rokid Max. .
6dof spatial sound field comparison. The left side is the effect of a third-party SDK. When rotating from 0 degrees to 90 degrees, the change of each frequency is not smooth, and the decrease is sharp at first, and the subsequent changes are very small. The 6dof spatial sound field made by Rokid on the right has obvious changes in different frequency bands as your position changes. The picture shows the performance of different angles, different frequency bands, and different amplitudes.
With the era of metaverse With the advent of 2020 and the rise of AR and VR technologies, the development of spatial sound fields has also ushered in new opportunities.
The development trend of spatial sound fields is mainly reflected in three aspects:
First, immersion, people can follow the real world Provide feedback to better integrate and interact virtual and real, and truly achieve an immersive experience. All sounds in the virtual world should not be free from the influence of any objects in the real world, because this will make people feel that it is still separate. In addition to integration, interaction is also required. For example, in the virtual world, you can interact with the enhanced sound on the AR terminal through different methods such as voice and gestures, to choose to pause, play, or switch windows of different levels and perspectives, or to feel your own way. Voices of interest and more.
The second is refinement, which involves refined exploration and practice in different aspects such as HRTF, resolution, test methods, and customization. The more difficult thing to refine is the head pass, because the generation method of the head pass itself is more time-consuming and laborious. It needs to play every point at different distances in the entire spherical space, and then sample the ear canal. Currently, some scholars are studying how to generate the same degree of refinement with fewer sampling points, and how to achieve higher accuracy through interpolation or other technical means; at the same time, from a longer-term perspective, the refinement One limit is customized implementation.
#The third is privacy and sound effects, and experience the auditory feast brought by sounds in different frequency bands. Different harmonics or different frequency bands give us different feelings. For example, severe reverberation will affect human hearing, while appropriate reverberation will bring rich listening experience in terms of sound quality; especially early reverberation, it is often used to judge timbre, below 3K The reverberation and lateral reflection will help create a better sense of space and depth, while the high-frequency component will help us achieve a sense of surround.
Why does Rokid create spatial sound fields? There are three main reasons:
First, immersion. We have been pursuing the integration of the digital world and the physical world, such as the vividness when playing games, the reality of online meetings or online education.
Second, virtual and real interaction. We believe that the future in this world will be a fusion of reality and reality. Based on the fusion, many interactions can be made, including the process of spatial perception, the interaction of subjective behaviors, etc. Spatial perception refers to aspects of the world such as the size of objects, the size of space, materials, etc. This perception then forms an impact on virtual sounds; the interaction of subjective behavior is human intervention, selection, and interaction with sounds in the digital world. communicate.
Three, ultimate quality. AR Glass is different from mobile phones, tablets, TVs and other products. When you use your mobile phone, network disconnection or lag is tolerable, but the real-time requirements for AR Glass worn on your eyes are very high. How can we achieve this high real-time requirement? This involves the overall optimization of algorithms, engineering, systems, hardware, and applications.
These are the missions we have been pursuing. Rokid hopes to directly promote and popularize these capabilities to the public through AR Glass products; at the same time, we also hope to use these technologies as part of our Yoda OS The basic capabilities are released, thereby indirectly benefiting users and empowering all walks of life through the use of developers.
Now the conference speech replay and PPT are online, go to the official website to view the exciting content (https://www.php.cn/link/53253027fef2ab5162a602f2acfed431 )
The above is the detailed content of Wang Wenbing, head of Rokid algorithm: 'Sound' under AR is in a 'wonderful' state. For more information, please follow other related articles on the PHP Chinese website!