Home > Article > Technology peripherals > CVPR 2024 full score paper: Zhejiang University proposes a new method of high-quality monocular dynamic reconstruction based on deformable three-dimensional Gaussian
Monocular Dynamic Scene refers to a dynamic environment observed and analyzed using a monocular camera, in which objects can move freely in the scene. Monocular dynamic scene reconstruction is of critical significance in tasks such as understanding dynamic changes in the environment, predicting object motion trajectories, and generating dynamic digital assets. Using monocular vision technology, three-dimensional reconstruction and model estimation of dynamic scenes can be achieved, helping us better understand and deal with various situations in dynamic environments. This technology can not only be applied in the field of computer vision, but also play an important role in fields such as autonomous driving, augmented reality, and virtual reality. Through monocular dynamic scene reconstruction, we can more accurately capture the motion of objects in the environment
With the rise of neural rendering represented by Neural Radiance Field (Neural Radiance Field, NeRF), more and more Work began on using implicit representation for 3D reconstruction of dynamic scenes. Although some representative works based on NeRF, such as D-NeRF, Nerfies, K-planes, etc., have achieved satisfactory rendering quality, they are still far away from true photo-realistic rendering.
The research team from Zhejiang University and ByteDance pointed out that the core of the above problem is that the NeRF pipeline based on ray casting maps the observation space to the observation space through backward-flow. Accuracy and clarity challenges arise when canonical space is used. Inverse mapping is not ideal for the convergence of the learned structure, resulting in the current method only achieving a PSNR rendering index of 30 levels on the D-NeRF dataset.
To solve this challenge, the research team proposed a monocular dynamic scene modeling process based on rasterization. They combined deformation fields with 3D Gaussians for the first time, creating a new method that enables high-quality reconstruction and new perspective rendering. This research paper "Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction" has been accepted by CVPR 2024, the top international academic conference in the field of computer vision. What is unique in this work is that it is the first study to apply deformation fields to 3D Gaussians to extend to monocular dynamic scenes.
Project homepage: https://ingra14m.github.io/Deformable-Gaussians/
Paper link: https://arxiv.org/abs/2309.13101
Code: https://github.com/ingra14m/Deformable-3D-Gaussians
Experimental results show that the deformation field can effectively map the 3D Gaussian forward mapping in the canonical space to the observation space accurately. On the D-NeRF data set, a PSNR improvement of more than 10% was achieved. In addition, in real scenes, even if the camera pose is not accurate enough, the rendering details can be increased.
# 图 1 The experimental results of the real scene of hypernerf.
Related work
Dynamic scene reconstruction has always been a hot issue in three-dimensional reconstruction. As neural rendering represented by NeRF achieves high-quality rendering, a series of work based on implicit representation has emerged in the field of dynamic reconstruction. D-NeRF and Nerfies introduce deformation fields based on the NeRF raycasting pipeline to achieve robust dynamic scene reconstruction. TiNeuVox, K-Planes and Hexplanes introduce a grid structure on this basis, which greatly speeds up the model training process and improves the rendering speed. However, these methods are all based on inverse mapping and cannot truly achieve high-quality decoupling of gauge space and deformation fields. 3D Gaussian Splash is a point cloud rendering pipeline based on rasterization. Its CUDA-customized differentiable Gaussian rasterization pipeline and innovative densification enable 3D Gaussian to not only achieve SOTA rendering quality, but also achieve real-time rendering. Dynamic 3D Gaussian first extends the static 3D Gaussian to the dynamic field. However, its ability to only handle multi-view scenes severely limits its application in more general situations, such as single-view scenes such as mobile phone shooting.Research Thought
The core of Deformable-GS is to extend the static 3D Gaussian to monocular dynamic scenes. Each 3D Gaussian carries position, rotation, scale, opacity and SH coefficients for image-level rendering. According to the formula of the 3D Gaussian alpha-blend, it is not difficult to find that the position over time, as well as the rotation and scaling that controls the Gaussian shape, are the decisive parameters that determine the dynamic 3D Gaussian. However, unlike traditional point cloud-based rendering methods, after 3D Gaussian is initialized, parameters such as position and transparency will be continuously updated with optimization. This adds difficulty to the learning of dynamic Gaussians. ###This research innovatively proposes a dynamic scene rendering framework that is jointly optimized with deformation fields and 3D Gaussians. Specifically, this study treats 3D Gaussians initialized by COLMAP or random point clouds as a canonical space, and then uses the deformation field to use the coordinate information of the 3D Gaussians in the canonical space as input to predict the position and shape of each 3D Gaussian over time. parameter. Using deformation fields, this study can transform a 3D Gaussian from canonical space to observation space for rasterized rendering. This strategy does not affect the differentiable rasterization pipeline of 3D Gaussians, and the gradients calculated by it can be used to update the parameters of the canonical space 3D Gaussians.
In addition, the introduction of the deformation field is beneficial to the Gaussian densification of parts with larger motion ranges. This is because the gradient of the deformation field in areas with larger movement amplitudes will be relatively higher, thus guiding the corresponding areas to be more finely regulated during the densification process. Even though the number and position parameters of the canonical space 3D Gaussians are constantly updated in the early stage, the experimental results show that this joint optimization strategy can eventually achieve robust convergence results. After approximately 20,000 iterations, the positional parameters of the 3D Gaussian in the canonical space hardly change anymore.
The research team found that camera poses in real scenes are often not accurate enough, and dynamic scenes exacerbate this problem. This will not have a big impact on the structure based on the neural radiation field, because the neural radiation field is based on the multilayer perceptron (MLP) and is a very smooth structure. However, 3D Gaussian is based on the explicit structure of point clouds, and slightly inaccurate camera poses are difficult to robustly correct through Gaussian splashing.
In order to alleviate this problem, this study innovatively introduced Annealing Smooth Training (AST). This training mechanism is designed to smooth the learning of 3D Gaussians in the early stage and increase the details of rendering in the later stage. The introduction of this mechanism not only improves the quality of rendering, but also greatly improves the stability and smoothness of temporal interpolation tasks.
Figure 2 shows the pipeline of this research. For details, please see the original text of the paper.
Result Display
This study first conducted experiments on synthetic data sets on the D-NeRF data set, which is widely used in the field of dynamic reconstruction. . It is not difficult to see from the visualization results in Figure 3 that Deformable-GS has a huge improvement in rendering quality compared to the previous method.
# Figure 3 Qualitative experimental comparison results of this study on the D-NeRF data set.
The method proposed in this study not only achieves substantial improvements in visual effects, but also has corresponding improvements in quantitative indicators of rendering. It is worth noting that the research team found errors in the Lego scenes of the D-NeRF data set, that is, there are slight differences between the scenes in the training set and the test set. This manifests itself in the inconsistent flip angle of the Lego model shovel. This is also the fundamental reason why the indicators of the previous method cannot be improved in the Lego scene. To enable meaningful comparisons, the study used Lego's validation set as a baseline for metric measurements.Figure 4 Quantitative comparison on synthetic datasets.
As shown in Figure 4, this study compared SOTA methods at full resolution (800x800), including D-NeRF of CVPR 2020, TiNeuVox of Sig Asia 2022 and CVPR2023 Tensor4D, K-planes. The method proposed in this study has achieved substantial improvements in various rendering indicators (PSNR, SSIM, LPIPS) and in various scenarios. The method proposed in this study is not only applicable to synthetic scenes, but also achieves SOTA results in real scenes where the camera pose is not accurate enough. As shown in Figure 5, this study compares with the SOTA method on the NeRF-DS dataset. Experimental results show that even without special treatment of highly reflective surfaces, the method proposed in this study can still surpass NeRF-DS, which is specially designed for highly reflective scenes, and achieve the best rendering effect.# 图 Figure 5 Real scene method comparison.
In addition, this research also applies a differentiable Gaussian rasterization pipeline with forward and backward depth propagation for the first time. As shown in Figure 6, this depth also proves that Deformable-GS can also obtain robust geometric representations. Deep backpropagation can promote many tasks that require deep supervision in the future, such as inverse rendering (Inverse Rendering), SLAM and autonomous driving.
# Figure 6 Depth visualization.
##About the author
The corresponding author of the paper is Professor Jin Xiaogang from the School of Computer Science and Technology, Zhejiang University.
Email: jin@cad.zju.edu.cn
The above is the detailed content of CVPR 2024 full score paper: Zhejiang University proposes a new method of high-quality monocular dynamic reconstruction based on deformable three-dimensional Gaussian. For more information, please follow other related articles on the PHP Chinese website!