


Monocular Dynamic Scene refers to a dynamic environment observed and analyzed using a monocular camera, in which objects can move freely in the scene. Monocular dynamic scene reconstruction is of critical significance in tasks such as understanding dynamic changes in the environment, predicting object motion trajectories, and generating dynamic digital assets. Using monocular vision technology, three-dimensional reconstruction and model estimation of dynamic scenes can be achieved, helping us better understand and deal with various situations in dynamic environments. This technology can not only be applied in the field of computer vision, but also play an important role in fields such as autonomous driving, augmented reality, and virtual reality. Through monocular dynamic scene reconstruction, we can more accurately capture the motion of objects in the environment
With the rise of neural rendering represented by Neural Radiance Field (Neural Radiance Field, NeRF), more and more Work began on using implicit representation for 3D reconstruction of dynamic scenes. Although some representative works based on NeRF, such as D-NeRF, Nerfies, K-planes, etc., have achieved satisfactory rendering quality, they are still far away from true photo-realistic rendering.
The research team from Zhejiang University and ByteDance pointed out that the core of the above problem is that the NeRF pipeline based on ray casting maps the observation space to the observation space through backward-flow. Accuracy and clarity challenges arise when canonical space is used. Inverse mapping is not ideal for the convergence of the learned structure, resulting in the current method only achieving a PSNR rendering index of 30 levels on the D-NeRF dataset.
To solve this challenge, the research team proposed a monocular dynamic scene modeling process based on rasterization. They combined deformation fields with 3D Gaussians for the first time, creating a new method that enables high-quality reconstruction and new perspective rendering. This research paper "Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction" has been accepted by CVPR 2024, the top international academic conference in the field of computer vision. What is unique in this work is that it is the first study to apply deformation fields to 3D Gaussians to extend to monocular dynamic scenes.
Project homepage: https://ingra14m.github.io/Deformable-Gaussians/
Paper link: https://arxiv.org/abs/2309.13101
Code: https://github.com/ingra14m/Deformable-3D-Gaussians
Experimental results show that the deformation field can effectively map the 3D Gaussian forward mapping in the canonical space to the observation space accurately. On the D-NeRF data set, a PSNR improvement of more than 10% was achieved. In addition, in real scenes, even if the camera pose is not accurate enough, the rendering details can be increased.
# 图 1 The experimental results of the real scene of hypernerf.
Related work
Dynamic scene reconstruction has always been a hot issue in three-dimensional reconstruction. As neural rendering represented by NeRF achieves high-quality rendering, a series of work based on implicit representation has emerged in the field of dynamic reconstruction. D-NeRF and Nerfies introduce deformation fields based on the NeRF raycasting pipeline to achieve robust dynamic scene reconstruction. TiNeuVox, K-Planes and Hexplanes introduce a grid structure on this basis, which greatly speeds up the model training process and improves the rendering speed. However, these methods are all based on inverse mapping and cannot truly achieve high-quality decoupling of gauge space and deformation fields. 3D Gaussian Splash is a point cloud rendering pipeline based on rasterization. Its CUDA-customized differentiable Gaussian rasterization pipeline and innovative densification enable 3D Gaussian to not only achieve SOTA rendering quality, but also achieve real-time rendering. Dynamic 3D Gaussian first extends the static 3D Gaussian to the dynamic field. However, its ability to only handle multi-view scenes severely limits its application in more general situations, such as single-view scenes such as mobile phone shooting.Research Thought
The core of Deformable-GS is to extend the static 3D Gaussian to monocular dynamic scenes. Each 3D Gaussian carries position, rotation, scale, opacity and SH coefficients for image-level rendering. According to the formula of the 3D Gaussian alpha-blend, it is not difficult to find that the position over time, as well as the rotation and scaling that controls the Gaussian shape, are the decisive parameters that determine the dynamic 3D Gaussian. However, unlike traditional point cloud-based rendering methods, after 3D Gaussian is initialized, parameters such as position and transparency will be continuously updated with optimization. This adds difficulty to the learning of dynamic Gaussians. ###This research innovatively proposes a dynamic scene rendering framework that is jointly optimized with deformation fields and 3D Gaussians. Specifically, this study treats 3D Gaussians initialized by COLMAP or random point clouds as a canonical space, and then uses the deformation field to use the coordinate information of the 3D Gaussians in the canonical space as input to predict the position and shape of each 3D Gaussian over time. parameter. Using deformation fields, this study can transform a 3D Gaussian from canonical space to observation space for rasterized rendering. This strategy does not affect the differentiable rasterization pipeline of 3D Gaussians, and the gradients calculated by it can be used to update the parameters of the canonical space 3D Gaussians.
In addition, the introduction of the deformation field is beneficial to the Gaussian densification of parts with larger motion ranges. This is because the gradient of the deformation field in areas with larger movement amplitudes will be relatively higher, thus guiding the corresponding areas to be more finely regulated during the densification process. Even though the number and position parameters of the canonical space 3D Gaussians are constantly updated in the early stage, the experimental results show that this joint optimization strategy can eventually achieve robust convergence results. After approximately 20,000 iterations, the positional parameters of the 3D Gaussian in the canonical space hardly change anymore.
The research team found that camera poses in real scenes are often not accurate enough, and dynamic scenes exacerbate this problem. This will not have a big impact on the structure based on the neural radiation field, because the neural radiation field is based on the multilayer perceptron (MLP) and is a very smooth structure. However, 3D Gaussian is based on the explicit structure of point clouds, and slightly inaccurate camera poses are difficult to robustly correct through Gaussian splashing.
In order to alleviate this problem, this study innovatively introduced Annealing Smooth Training (AST). This training mechanism is designed to smooth the learning of 3D Gaussians in the early stage and increase the details of rendering in the later stage. The introduction of this mechanism not only improves the quality of rendering, but also greatly improves the stability and smoothness of temporal interpolation tasks.
Figure 2 shows the pipeline of this research. For details, please see the original text of the paper.
Result Display
This study first conducted experiments on synthetic data sets on the D-NeRF data set, which is widely used in the field of dynamic reconstruction. . It is not difficult to see from the visualization results in Figure 3 that Deformable-GS has a huge improvement in rendering quality compared to the previous method.
# Figure 3 Qualitative experimental comparison results of this study on the D-NeRF data set.
The method proposed in this study not only achieves substantial improvements in visual effects, but also has corresponding improvements in quantitative indicators of rendering. It is worth noting that the research team found errors in the Lego scenes of the D-NeRF data set, that is, there are slight differences between the scenes in the training set and the test set. This manifests itself in the inconsistent flip angle of the Lego model shovel. This is also the fundamental reason why the indicators of the previous method cannot be improved in the Lego scene. To enable meaningful comparisons, the study used Lego's validation set as a baseline for metric measurements.Figure 4 Quantitative comparison on synthetic datasets.
As shown in Figure 4, this study compared SOTA methods at full resolution (800x800), including D-NeRF of CVPR 2020, TiNeuVox of Sig Asia 2022 and CVPR2023 Tensor4D, K-planes. The method proposed in this study has achieved substantial improvements in various rendering indicators (PSNR, SSIM, LPIPS) and in various scenarios. The method proposed in this study is not only applicable to synthetic scenes, but also achieves SOTA results in real scenes where the camera pose is not accurate enough. As shown in Figure 5, this study compares with the SOTA method on the NeRF-DS dataset. Experimental results show that even without special treatment of highly reflective surfaces, the method proposed in this study can still surpass NeRF-DS, which is specially designed for highly reflective scenes, and achieve the best rendering effect.# 图 Figure 5 Real scene method comparison.
In addition, this research also applies a differentiable Gaussian rasterization pipeline with forward and backward depth propagation for the first time. As shown in Figure 6, this depth also proves that Deformable-GS can also obtain robust geometric representations. Deep backpropagation can promote many tasks that require deep supervision in the future, such as inverse rendering (Inverse Rendering), SLAM and autonomous driving.
# Figure 6 Depth visualization.
##About the author
The corresponding author of the paper is Professor Jin Xiaogang from the School of Computer Science and Technology, Zhejiang University.
Email: jin@cad.zju.edu.cn
- ##Personal homepage: http://www.cad.zju.edu. cn/home/jin/
The above is the detailed content of CVPR 2024 full score paper: Zhejiang University proposes a new method of high-quality monocular dynamic reconstruction based on deformable three-dimensional Gaussian. For more information, please follow other related articles on the PHP Chinese website!

PowerInfer提高了在消费级硬件上运行AI的效率上海交大团队最新推出了超强CPU/GPULLM高速推理引擎PowerInfer。PowerInfer和llama.cpp都在相同的硬件上运行,并充分利用了RTX4090上的VRAM。这个推理引擎速度有多快?在单个NVIDIARTX4090GPU上运行LLM,PowerInfer的平均token生成速率为13.20tokens/s,峰值为29.08tokens/s,仅比顶级服务器A100GPU低18%,可适用于各种LLM。PowerInfer与

要让大型语言模型(LLM)充分发挥其能力,有效的prompt设计方案是必不可少的,为此甚至出现了promptengineering(提示工程)这一新兴领域。在各种prompt设计方案中,思维链(CoT)凭借其强大的推理能力吸引了许多研究者和用户的眼球,基于其改进的CoT-SC以及更进一步的思维树(ToT)也收获了大量关注。近日,苏黎世联邦理工学院、Cledar和华沙理工大学的一个研究团队提出了更进一步的想法:思维图(GoT)。让思维从链到树到图,为LLM构建推理过程的能力不断得到提升,研究者也通

近期,复旦大学自然语言处理团队(FudanNLP)推出LLM-basedAgents综述论文,全文长达86页,共有600余篇参考文献!作者们从AIAgent的历史出发,全面梳理了基于大型语言模型的智能代理现状,包括:LLM-basedAgent的背景、构成、应用场景、以及备受关注的代理社会。同时,作者们探讨了Agent相关的前瞻开放问题,对于相关领域的未来发展趋势具有重要价值。论文链接:https://arxiv.org/pdf/2309.07864.pdfLLM-basedAgent论文列表:

将不同的基模型象征为不同品种的狗,其中相同的「狗形指纹」表明它们源自同一个基模型。大模型的预训练需要耗费大量的计算资源和数据,因此预训练模型的参数成为各大机构重点保护的核心竞争力和资产。然而,与传统软件知识产权保护不同,对预训练模型参数盗用的判断存在以下两个新问题:1)预训练模型的参数,尤其是千亿级别模型的参数,通常不会开源。预训练模型的输出和参数会受到后续处理步骤(如SFT、RLHF、continuepretraining等)的影响,这使得判断一个模型是否基于另一个现有模型微调得来变得困难。无

FATE2.0全面升级,推动隐私计算联邦学习规模化应用FATE开源平台宣布发布FATE2.0版本,作为全球领先的联邦学习工业级开源框架。此次更新实现了联邦异构系统之间的互联互通,持续增强了隐私计算平台的互联互通能力。这一进展进一步推动了联邦学习与隐私计算规模化应用的发展。FATE2.0以全面互通为设计理念,采用开源方式对应用层、调度、通信、异构计算(算法)四个层面进行改造,实现了系统与系统、系统与算法、算法与算法之间异构互通的能力。FATE2.0的设计兼容了北京金融科技产业联盟的《金融业隐私计算

大型语言模型(LLM)被广泛应用于需要多个链式生成调用、高级提示技术、控制流以及与外部环境交互的复杂任务。尽管如此,目前用于编程和执行这些应用程序的高效系统却存在明显的不足之处。研究人员最近提出了一种新的结构化生成语言(StructuredGenerationLanguage),称为SGLang,旨在改进与LLM的交互性。通过整合后端运行时系统和前端语言的设计,SGLang使得LLM的性能更高、更易控制。这项研究也获得了机器学习领域的知名学者、CMU助理教授陈天奇的转发。总的来说,SGLang的

IBM再度发力。随着AI系统的飞速发展,其能源需求也在不断增加。训练新系统需要大量的数据集和处理器时间,因此能耗极高。在某些情况下,执行一些训练好的系统,智能手机就能轻松胜任。但是,执行的次数太多,能耗也会增加。幸运的是,有很多方法可以降低后者的能耗。IBM和英特尔已经试验过模仿实际神经元行为设计的处理器。IBM还测试了在相变存储器中执行神经网络计算,以避免重复访问RAM。现在,IBM又推出了另一种方法。该公司的新型NorthPole处理器综合了上述方法的一些理念,并将其与一种非常精简的计算运行

去噪扩散模型(DDM)是目前广泛应用于图像生成的一种方法。最近,XinleiChen、ZhuangLiu、谢赛宁和何恺明四人团队对DDM进行了解构研究。通过逐步剥离其组件,他们发现DDM的生成能力逐渐下降,但表征学习能力仍然保持一定水平。这说明DDM中的某些组件对于表征学习的作用可能并不重要。针对当前计算机视觉等领域的生成模型,去噪被认为是一种核心方法。这类方法通常被称为去噪扩散模型(DDM),通过学习一个去噪自动编码器(DAE),能够通过扩散过程有效地消除多个层级的噪声。这些方法实现了出色的图


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Notepad++7.3.1
Easy-to-use and free code editor

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version
God-level code editing software (SublimeText3)
