Home  >  Article  >  Technology peripherals  >  See through 3D representation and generative models of objects: NUS team proposes X-Ray

See through 3D representation and generative models of objects: NUS team proposes X-Ray

王林
王林forward
2024-05-06 18:30:131039browse

See through 3D representation and generative models of objects: NUS team proposes X-Ray

  • Project homepage: https://tau-yihouxiang.github.io/projects/X-Ray/X-Ray.html
  • Paper address: https://arxiv.org/abs/2404.14329
  • Code address: https://github.com/tau-yihouxiang/ X-Ray
  • Dataset: https://huggingface.co/datasets/yihouxiang/X-Ray

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Currently, artificial intelligence is developing rapidly in the field of human intelligence. In computer vision, image and video generation technology has become increasingly mature, and models such as Midjourney and Stable Video Diffusion are widely used. However, generative models in the field of 3D vision still face challenges.

Current 3D model generation technology is usually based on multi-angle video generation and reconstruction, such as the SV3D model, by generating multi-angle videos and combining neural radiation fields (NeRF) or 3D Gaussian smooth models ( 3D Gaussian Splatting technology) to build 3D objects step by step. This method is mainly limited to the generation of simple, non-self-occlusion three-dimensional objects, and cannot present the internal structure of the object, making the entire generation process complex and imperfect, showing the complexity and limitations of this technology.

The reason is that there is currently a lack of flexible, efficient and easy to generalize 3D Representation (3D representation).

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Figure 1. X-Ray serialized 3D representation

National University of Singapore (NUS) Hu Dr. Run led the research team to release a new 3D representation - X-ray, which can sequentially represent the surface shape and texture of objects seen from the perspective of the camera, and can make full use of the video generation function to generate Model advantages are used to generate 3D objects, and the internal and external 3D structures of the objects can be generated at the same time.

This article will demonstrate in detail the principles, advantages and broad application prospects of X-Ray technology.

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Figure 2. Comparison with rendering-based 3D model generation methods.

Technological innovation: 3D representation method of the inner and outer surfaces of the object

X-Ray representation: H×W starting from the center of the camera toward the direction of the object A matrix point emits a ray. In each ray direction, L three-dimensional attribute data including depth, normal vector, color, etc. are recorded one by one at the intersection point with the object's surface, and then these data are organized into the form of L×H×W to realize the creation of any 3D model. Tensor representation, this is the X-Ray representation method proposed by the team.

It is worth noting that the representation is the same as the video format, so the video generation model can be used to make 3D generative models. The specific process is as follows.

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Figure 3. X-Ray sample with different number of layers.

1. Encoding process: 3D model to X-Ray

Given a 3D model, usually three-dimensional Grid, first set up a camera to observe the model, and then use the Ray Casting Algorithm to record the properties of all surfaces where each camera ray intersects with the object See through 3D representation and generative models of objects: NUS team proposes X-Ray, including the depth of the surface See through 3D representation and generative models of objects: NUS team proposes X-Ray, normal vector See through 3D representation and generative models of objects: NUS team proposes X-Ray, color See through 3D representation and generative models of objects: NUS team proposes X-Ray, etc., for convenience of instruction, use See through 3D representation and generative models of objects: NUS team proposes X-Ray to represent the Whether a surface exists at the location.

Then, by obtaining all intersecting surface points such as camera rays, a complete X-Ray 3D expression can be obtained, as shown in the following expression and Figure 3.

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Through the encoding process, an arbitrary 3D model is converted into X-Ray. It is the same as the video format and has a different number of frames. Normally, the number of frames L=8 is enough to represent a 3D object.

2. Decoding process: X-Ray to 3D model

Given an X-Ray, you can also It is converted back to a 3D model through the decoding process, so that the 3D model can be generated only by generating X-Ray. The specific process includes two processes: point cloud generation process and point cloud surface reconstruction process.

  • X-Ray to point cloud: In addition to the position coordinates of the 3D point, each point also has color and normal vector information.

See through 3D representation and generative models of objects: NUS team proposes X-Ray

where r_0 and r_d are the starting point and normalized direction of the camera ray respectively. By processing each camera ray, we get A complete point cloud can be obtained.

  • Point cloud to three-dimensional mesh: The next step is to convert the point cloud into a three-dimensional mesh. This is A technology that has been studied for many years, because these point clouds have normal vectors, the Screened Poisson algorithm is used to directly convert the point cloud into a three-dimensional mesh model, which is the final 3D model.

3D model generation based on X-Ray representation

In order to generate high-resolution and diverse 3D X-Ray models, the The team used a video diffusion model architecture similar to the video format. This architecture can process continuous 3D information and improve the quality of X-Ray through upsampling modules to generate high-precision 3D output. The diffusion model is responsible for gradually generating detailed 3D images from noisy data, while the upsampling module enhances image resolution and detail to meet high quality standards. The specific structure is shown in Figure 4.

X-Ray Diffusion Generation Model

Diffusion model uses latent space in X-Ray generation and usually requires custom development of vector quantization - Variational autoencoder (VQ-VAE) [3] performs data compression, a process that lacks ready-made models and increases the training burden.

In order to effectively train high-resolution generators, the team adopted a cascade synthesis strategy to gradually train from low to high resolution through technologies such as Imagen and Stable Cascaded to adapt to the limited computing resources and improve X-Ray image quality.

Specifically, the 3D U-Net architecture in Stable Video Diffusion is used as the diffusion model to generate low-resolution Extract features from sequences to enhance processing and interpretation of X-Ray, which is critical for high-quality results.

X-Ray upsampling model

The diffusion model of the previous stage can only generate low-resolution Ray image. In subsequent stages, the focus is on upgrading these low-resolution X-Rays to higher resolutions.

The team explored two main approaches: point cloud upsampling and video upsampling.

Since a rough representation of shape and appearance has been obtained, encoding this data into a point cloud with color and normals is a straightforward process.

However, the point cloud representation structure is too loose and unsuitable for dense prediction. Traditional point cloud upsampling techniques usually simply increase the number of points, which is useful for improving things such as texture and color. The attribute may not be valid enough. To simplify the process and ensure consistency throughout the pipeline, we chose to use a video upsampling model.

This model is adapted from the spatiotemporal VAE decoder of Stable Video Diffusion (SVD), specially trained from scratch to upsample synthesized X-Ray frames by a factor of 4x while maintaining the original of layers. The decoder is able to perform attention operations independently at the frame level and hierarchical levels. This dual-layer attention mechanism not only improves the resolution, but also significantly improves the overall quality of the image. These features make the video upsampling model a more coordinated and efficient solution in high-resolution X-Ray generation.

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Figure 4: 3D model generation framework based on X-Ray representation, including X-Ray diffusion model and X-Ray upsampling model.

Experiment

1. Data set:

Experimental use A filtered subset of the Objaverse dataset was created, from which entries with missing textures and insufficient hints were removed.

This subset contains over 60,000 3D objects. For each object, 4 camera views are randomly selected, covering azimuth angles from -180 to 180 degrees and elevation angles from -45 to 45 degrees, and the distance from the camera to the center of the object is fixed to 1.5.

Then use Blender software for rendering, and generate the corresponding X-Ray through the ray casting algorithm provided by the trimesh library. Through these processes, over 240,000 pairs of images and X-Ray datasets can be created to train generative models.

2. Implementation details:

The X-Ray diffusion model is based on the spatiotemporal UNet architecture used in Stable Video Diffusion (SVD), with slight adjustments: the model is configured to synthesize 8 channels: 1 hit channel, 1 depth channel and 6 Normal channel, compared to the 4 channels of the original network.

Given the significant differences between X-Ray imaging and traditional video, the model was trained from scratch to bridge the large gap between the X-Ray and video fields. Training took place over a week on 8 NVIDIA A100 GPU servers. During this period, the learning rate was kept at 0.0001, using the AdamW optimizer.

Since different X-Ray has different number of layers, pad or crop them to the same 8 layers for better batch processing and training, the frame size of each layer is 64×64. For the upsampling model, the output of the L layer is still 8, but the resolution of each frame is increased to 256×256, which enhances the detail and clarity of the enlarged X-Ray. The results are shown in Figures 5 and 6.

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Figure 5: Image to X-Ray and to 3D model generation

See through 3D representation and generative models of objects: NUS team proposes X-Ray

Figure 6: Text to X-Ray and to 3D model generation

Future Outlook: New representation brings endless possibilities

With the continuous advancement of machine learning and image processing technology, the application prospects of X-Ray are infinitely broad.

In the future, this technology may be combined with augmented reality (AR) and virtual reality (VR) technology to create a fully immersive 3D experience for users. Education and training fields can also benefit from this, such as providing more intuitive learning materials and simulation experiments through 3D reconstruction.

In addition, the application of X-Ray technology in the fields of medical imaging and biotechnology may change people's understanding and research methods of complex biological structures. Look forward to how it changes the way you interact with the three-dimensional world.

##

The above is the detailed content of See through 3D representation and generative models of objects: NUS team proposes X-Ray. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete