Home > Article > Technology peripherals > First article: A new paradigm for training multi-view 3D occupancy models using only 2D labels
This article is reprinted with the authorization of the Autonomous Driving Heart public account. Please contact the source for reprinting.
[RenderOcc, the first new paradigm for training multi-view 3D occupancy models using only 2D labels] The author extracts NeRF-style 3D volume representations from multi-view images and uses volume rendering techniques to build 2D reconstructions, thus Enables direct 3D supervision from 2D semantic and depth labels, reducing reliance on expensive 3D occupancy annotations. Extensive experiments show that RenderOcc performs comparably to fully supervised models using 3D labels, highlighting the importance of this approach in real-world applications. Already open source.
Title: RenderOcc: Vision-Centric 3D Occupancy Prediction with 2DRendering Supervision
Author affiliation: Peking University, Xiaomi Automobile, Hong Kong Chinese MMLAB
The content that needs to be rewritten is: Open source address: GitHub - pmj110119/RenderOcc
3D occupancy prediction has important prospects in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent work mainly utilizes complete occupancy labels in 3D voxel space for supervision. However, expensive annotation processes and sometimes ambiguous labels severely limit the usability and scalability of 3D occupancy models. To solve this problem, the authors propose RenderOcc, a new paradigm for training 3D occupancy models using only 2D labels. Specifically, we extract NeRF-style 3D volumetric representations from multi-view images and use volume rendering techniques to build 2D reconstructions, enabling direct 3D supervision from 2D semantic and depth labels. In addition, the authors introduce an auxiliary ray method to solve the sparse viewpoint problem in autonomous driving scenes, which utilizes sequential frames to build a comprehensive 2D rendering for each target. RenderOcc is the first attempt to train a multi-view 3D occupancy model using only 2D labels, reducing the reliance on expensive 3D occupancy annotations. Extensive experiments show that RenderOcc performs comparably to fully supervised models using 3D labels, highlighting the importance of this approach in real-world applications.
Figure 1 shows a new training method for RenderOcc. Different from previous methods that rely on expensive 3D occupancy labels for supervision, the RenderOcc proposed in this paper utilizes 2D labels to train the 3D occupancy network. With 2D rendering supervision, the model is able to benefit from fine-grained 2D pixel-level semantics and depth supervision
Figure 2. Overall framework of RenderOcc. This paper extracts volumetric features through a 2D to 3D network and predicts the density and semantics of each voxel. Therefore, this paper generates a Semantic Density Field, which can perform volume rendering to generate rendered 2D semantics and depth. For the generation of Rays GT, this paper extracts auxiliary rays from adjacent frames to supplement the rays of the current frame and uses the proposed weighted ray sampling strategy to purify them. Then, this article uses ray GT and {} to calculate the loss to achieve rendering supervision of 2D labels
Rewritten content: Figure 3. Auxiliary light: A single frame image cannot capture the multi-view information of the object well. There is only a small overlap area between adjacent cameras and the difference in viewing angle is limited. By introducing auxiliary rays from adjacent frames, the model can significantly benefit from multi-view consistency constraints
The content that needs to be rewritten is: Original link: https://mp.weixin.qq.com/s/WzI8mGoIOTOdL8irXrbSPQ
The above is the detailed content of First article: A new paradigm for training multi-view 3D occupancy models using only 2D labels. For more information, please follow other related articles on the PHP Chinese website!