Home > Article > Technology peripherals > See through 3D representation and generative models of objects: NUS team proposes X-Ray
Currently, artificial intelligence is developing rapidly in the field of human intelligence. In computer vision, image and video generation technology has become increasingly mature, and models such as Midjourney and Stable Video Diffusion are widely used. However, generative models in the field of 3D vision still face challenges.
Current 3D model generation technology is usually based on multi-angle video generation and reconstruction, such as the SV3D model, by generating multi-angle videos and combining neural radiation fields (NeRF) or 3D Gaussian smooth models ( 3D Gaussian Splatting technology) to build 3D objects step by step. This method is mainly limited to the generation of simple, non-self-occlusion three-dimensional objects, and cannot present the internal structure of the object, making the entire generation process complex and imperfect, showing the complexity and limitations of this technology.
The reason is that there is currently a lack of flexible, efficient and easy to generalize 3D Representation (3D representation).
Figure 1. X-Ray serialized 3D representation
National University of Singapore (NUS) Hu Dr. Run led the research team to release a new 3D representation - X-ray, which can sequentially represent the surface shape and texture of objects seen from the perspective of the camera, and can make full use of the video generation function to generate Model advantages are used to generate 3D objects, and the internal and external 3D structures of the objects can be generated at the same time.
This article will demonstrate in detail the principles, advantages and broad application prospects of X-Ray technology.
Figure 2. Comparison with rendering-based 3D model generation methods.
X-Ray representation: H×W starting from the center of the camera toward the direction of the object A matrix point emits a ray. In each ray direction, L three-dimensional attribute data including depth, normal vector, color, etc. are recorded one by one at the intersection point with the object's surface, and then these data are organized into the form of L×H×W to realize the creation of any 3D model. Tensor representation, this is the X-Ray representation method proposed by the team.
It is worth noting that the representation is the same as the video format, so the video generation model can be used to make 3D generative models. The specific process is as follows.
Figure 3. X-Ray sample with different number of layers.
1. Encoding process: 3D model to X-Ray
Given a 3D model, usually three-dimensional Grid, first set up a camera to observe the model, and then use the Ray Casting Algorithm to record the properties of all surfaces where each camera ray intersects with the object , including the depth of the surface , normal vector , color , etc., for convenience of instruction, use to represent the Whether a surface exists at the location.
Then, by obtaining all intersecting surface points such as camera rays, a complete X-Ray 3D expression can be obtained, as shown in the following expression and Figure 3.
Through the encoding process, an arbitrary 3D model is converted into X-Ray. It is the same as the video format and has a different number of frames. Normally, the number of frames L=8 is enough to represent a 3D object.
2. Decoding process: X-Ray to 3D model
Given an X-Ray, you can also It is converted back to a 3D model through the decoding process, so that the 3D model can be generated only by generating X-Ray. The specific process includes two processes: point cloud generation process and point cloud surface reconstruction process.
where r_0 and r_d are the starting point and normalized direction of the camera ray respectively. By processing each camera ray, we get A complete point cloud can be obtained.
In order to generate high-resolution and diverse 3D X-Ray models, the The team used a video diffusion model architecture similar to the video format. This architecture can process continuous 3D information and improve the quality of X-Ray through upsampling modules to generate high-precision 3D output. The diffusion model is responsible for gradually generating detailed 3D images from noisy data, while the upsampling module enhances image resolution and detail to meet high quality standards. The specific structure is shown in Figure 4.
Diffusion model uses latent space in X-Ray generation and usually requires custom development of vector quantization - Variational autoencoder (VQ-VAE) [3] performs data compression, a process that lacks ready-made models and increases the training burden.
In order to effectively train high-resolution generators, the team adopted a cascade synthesis strategy to gradually train from low to high resolution through technologies such as Imagen and Stable Cascaded to adapt to the limited computing resources and improve X-Ray image quality.
Specifically, the 3D U-Net architecture in Stable Video Diffusion is used as the diffusion model to generate low-resolution Extract features from sequences to enhance processing and interpretation of X-Ray, which is critical for high-quality results.
The diffusion model of the previous stage can only generate low-resolution Ray image. In subsequent stages, the focus is on upgrading these low-resolution X-Rays to higher resolutions.
The team explored two main approaches: point cloud upsampling and video upsampling.
Since a rough representation of shape and appearance has been obtained, encoding this data into a point cloud with color and normals is a straightforward process.
However, the point cloud representation structure is too loose and unsuitable for dense prediction. Traditional point cloud upsampling techniques usually simply increase the number of points, which is useful for improving things such as texture and color. The attribute may not be valid enough. To simplify the process and ensure consistency throughout the pipeline, we chose to use a video upsampling model.
This model is adapted from the spatiotemporal VAE decoder of Stable Video Diffusion (SVD), specially trained from scratch to upsample synthesized X-Ray frames by a factor of 4x while maintaining the original of layers. The decoder is able to perform attention operations independently at the frame level and hierarchical levels. This dual-layer attention mechanism not only improves the resolution, but also significantly improves the overall quality of the image. These features make the video upsampling model a more coordinated and efficient solution in high-resolution X-Ray generation.
Figure 4: 3D model generation framework based on X-Ray representation, including X-Ray diffusion model and X-Ray upsampling model.
1. Data set:
Experimental use A filtered subset of the Objaverse dataset was created, from which entries with missing textures and insufficient hints were removed.
This subset contains over 60,000 3D objects. For each object, 4 camera views are randomly selected, covering azimuth angles from -180 to 180 degrees and elevation angles from -45 to 45 degrees, and the distance from the camera to the center of the object is fixed to 1.5.
Then use Blender software for rendering, and generate the corresponding X-Ray through the ray casting algorithm provided by the trimesh library. Through these processes, over 240,000 pairs of images and X-Ray datasets can be created to train generative models.
2. Implementation details:
The X-Ray diffusion model is based on the spatiotemporal UNet architecture used in Stable Video Diffusion (SVD), with slight adjustments: the model is configured to synthesize 8 channels: 1 hit channel, 1 depth channel and 6 Normal channel, compared to the 4 channels of the original network.
Given the significant differences between X-Ray imaging and traditional video, the model was trained from scratch to bridge the large gap between the X-Ray and video fields. Training took place over a week on 8 NVIDIA A100 GPU servers. During this period, the learning rate was kept at 0.0001, using the AdamW optimizer.
Since different X-Ray has different number of layers, pad or crop them to the same 8 layers for better batch processing and training, the frame size of each layer is 64×64. For the upsampling model, the output of the L layer is still 8, but the resolution of each frame is increased to 256×256, which enhances the detail and clarity of the enlarged X-Ray. The results are shown in Figures 5 and 6.
Figure 5: Image to X-Ray and to 3D model generation
Figure 6: Text to X-Ray and to 3D model generation
With the continuous advancement of machine learning and image processing technology, the application prospects of X-Ray are infinitely broad.
In the future, this technology may be combined with augmented reality (AR) and virtual reality (VR) technology to create a fully immersive 3D experience for users. Education and training fields can also benefit from this, such as providing more intuitive learning materials and simulation experiments through 3D reconstruction.
In addition, the application of X-Ray technology in the fields of medical imaging and biotechnology may change people's understanding and research methods of complex biological structures. Look forward to how it changes the way you interact with the three-dimensional world.
The above is the detailed content of See through 3D representation and generative models of objects: NUS team proposes X-Ray. For more information, please follow other related articles on the PHP Chinese website!