


Written in front&The author's personal summary
Bird eye's view (BEV) detection is a method of detection by fusing multiple surround-view cameras. Most of the current algorithms are trained and evaluated on the same data set, which causes these algorithms to overfit to the unchanged camera internal parameters (camera type) and external parameters (camera placement). This paper proposes a BEV detection framework based on implicit rendering, which can solve the problem of object detection in unknown domains. The framework uses implicit rendering to establish the relationship between the 3D position of the object and the perspective position of a single view, which can be used to correct perspective bias. This method achieves significant performance improvements in domain generalization (DG) and unsupervised domain adaptation (UDA). This method is the first attempt to use only virtual data sets for training and evaluation of BEV detection in real scenarios, which can break the barriers between virtual and real to complete closed-loop testing.
- Paper link: https://arxiv.org/pdf/2310.11346.pdf
- Code link: https://github.com/EnVision-Research/Generalizable-BEV
BEV detection domain generalization problem background
Multi-camera detection refers to the use of multiple cameras to detect objects in three-dimensional space. Object detection and localization tasks. By combining information from different viewpoints, multi-camera 3D object detection can provide more accurate and robust object detection results, especially in situations where targets from certain viewpoints may be occluded or partially visible. In recent years, the Bird eye's view (BEV) method has received great attention in multi-camera detection tasks. Although these methods have advantages in multi-camera information fusion, the performance of these methods may be severely degraded when the test environment is significantly different from the training environment.
Currently, most BEV detection algorithms are trained and evaluated on the same data set, which causes these algorithms to be too sensitive to changes in internal and external camera parameters and urban road conditions, leading to serious over-fitting problems. However, in practical applications, BEV detection algorithms often need to adapt to different new models and new cameras, which leads to the failure of these algorithms. Therefore, it is important to study the generalizability of BEV detection. In addition, closed-loop simulation is also very important for autonomous driving, but currently it can only be evaluated in virtual engines such as Carla. Therefore, it is necessary to solve the problem of domain differences between the virtual engine and the real scene. Domain generalization (DG) and unsupervised domain adaptation (UDA) are the ways to alleviate the distribution shift. Two promising directions. DG methods often decouple and eliminate domain-specific features, thereby improving generalization performance in unseen domains. For UDA, recent methods mitigate domain shift by generating pseudo-labels or latent feature distribution alignment. However, learning viewpoint- and environment-independent features for pure visual perception is very challenging without using data from different viewpoints, camera parameters, and environments.
Observations show that 2D detection from a single perspective (camera plane) often has stronger generalization capabilities than 3D target detection from multiple perspectives, as shown in the figure. Some studies have explored integrating 2D detection into BEV detection, such as fusing 2D information into 3D detectors or establishing 2D-3D consistency. 2D information fusion is a learning-based method rather than a mechanism modeling method, and is still severely affected by domain migration. Existing 2D-3D consistency methods project the 3D results onto a two-dimensional plane and establish consistency. This constraint may harm the semantic information in the target domain instead of modifying the geometric information of the target domain. Furthermore, this 2D-3D consistency approach makes a unified approach for all detection heads challenging.
This paper proposes a generalized BEV detection based on perspective debiasing framework, which not only helps the model learn perspective and context-invariant features in the source domain, but also utilizes 2D detectors to further correct spurious geometric features in the target domain.
- This paper is the first attempt to study unsupervised domain adaptation in BEV detection and establishes a benchmark. State-of-the-art results are achieved on both UDA and DG protocols.
- This paper explores for the first time training on a virtual engine without real scene annotations to achieve real-world BEV detection tasks.
Problem definition
The research mainly focuses on enhancing the generalization of BEV detection change. To achieve this goal, this paper explores two protocols with widespread practical application, namely domain generalization (DG) and unsupervised domain adaptation (UDA):
Domain generalization (DG) of BEV detection:Train a BEV detection algorithm in an existing data set (source domain) to improve the detection performance in an unknown data set (target domain). For example, training a BEV detection model in a specific vehicle or scenario can directly generalize to a variety of different vehicles and scenarios. Unsupervised domain adaptation (UDA) for BEV detection: Train a BEV detection algorithm on an existing data set (source domain), and use unlabeled data in the target domain to improve detection performance. For example, in a new vehicle or city, just collecting some unsupervised data can improve the performance of the model in the new vehicle and new environment. It is worth mentioning that the only difference between DG and UDA is whether unlabeled data of the target domain can be utilized. In order to detect the unknown L=[x,y,z] of the object, most BEV detection will have two key parts (1) acquisition Image features from different viewing angles; (2) Fuse these image features into the BEV space and obtain the final prediction result: The above formula describes that domain deviation may originate from the feature extraction stage or the BEV fusion stage. Then this article pushed forward in the appendix, and obtained the viewing angle deviation of the final 3D prediction result projected to the 2D result: where k_u, b_u, k_v and b_v are related to the domain offset of the BEV encoder, d(u,v) is the final predicted depth information of the model. c_u and c_v represent the coordinates of the camera's optical center on the uv image plane. The above equation provides several important corollaries: (1) The existence of final position offset will lead to perspective bias, which shows that optimizing perspective bias can help alleviate domain offset. (2) Even the position of the point on the camera's optical center ray on the single-view imaging plane will shift. Intuitively, domain shift changes the position of BEV features, which is overfitting due to limited training data viewpoints and camera parameters. To alleviate this problem, it is crucial to re-render new view images from BEV features, thereby enabling the network to learn view- and environment-independent features. In view of this, this research aims to solve the perspective deviation related to different rendering viewpoints to improve the generalization ability of the model PD-BEV in total It is divided into three parts: semantic rendering, source domain debiasing and target domain debiasing as shown in Figure 1. Semantic rendering explains how to establish the perspective relationship between 2D and 3D through BEV features. Source domain debiasing describes how to improve model generalization capabilities through semantic rendering in the source domain. Target domain debiasing describes the use of unlabeled data in the target domain to improve model generalization capabilities through semantic rendering. Since many algorithms will compress the BEV volume into 2D features, we first use the BEV decoder to BEV features are converted into a volume: The above formula actually improves the BEV plane and adds a height dimension. Then the internal and external parameters of the camera can be sampled in this Volume to become a 2D feature map, and then the 2D feature map and the internal and external parameters of the camera are sent to a RenderNet to predict the heatmap and object properties of the corresponding perspective. Through such operations similar to Nerf, a bridge between 2D and 3D can be established. To improve the generalization performance of the model, there are several key points that need to be improved in the source domain. First, the 3D box of the source domain can be utilized to monitor the heatmap and properties of the newly rendered view to reduce perspective bias. Secondly, normalized depth information can be used to help image encoders better learn geometric information. These improvements will help improve the generalization performance of the model Perspective semantic supervision: Based on semantic rendering, heatmaps and attributes are rendered from different perspectives (output of RenderNet). At the same time, a camera's internal and external parameters are randomly sampled, and the object's box is projected from the 3D coordinates into the two-dimensional camera plane using these internal and external parameters. Then use Focal loss and L1 loss to constrain the projected 2Dbox and rendering results: Through this operation, overfitting of internal and external parameters of the camera can be reduced and the robustness to new perspectives can be improved. . It is worth mentioning that this paper converts supervised learning from RGB images to heat maps of object centers to avoid the shortcomings of the lack of new perspective RGB supervision in the field of unmanned driving Geometry Supervision:Providing clear depth information can effectively improve the performance of multi-camera 3D object detection. However, the depth of network predictions tends to overfit the intrinsic parameters. Therefore, this paper draws on a virtual depth method: where BCE() represents the binary cross-entropy loss, and D_{pre} represents the prediction depth of DepthNet. f_u and f_v are the u and v focal lengths of the image plane respectively, and U is a constant. It is worth noting that the depth here is the foreground depth information provided by using 3D boxes rather than point clouds. By doing this, DepthNet is more likely to focus on the depth of foreground objects. Finally, the virtual depth is converted back to the actual depth when the semantic features are lifted to the BEV plane using the actual depth information. There is no annotation in the target domain, so 3D box supervision cannot be used to improve the generalization ability of the model. So this paper explains that 2D detection results are more robust than 3D results. Therefore, this paper uses the 2D pre-trained detector in the source domain as the supervision of the rendered perspective, and also uses the pseudo-label mechanism: This operation can effectively utilize accurate 2D detection to correct the foreground target position in BEV space, which is an unsupervised regularization of the target domain. In order to further enhance the correction ability of 2D prediction, a pseudo method is used to enhance the confidence of the predicted heat map. This paper provides mathematical proofs in 3.2 and supplementary materials to explain the cause of 2D projection errors in 3D results. It also explains why bias can be removed in this way. For details, please refer to the original paper. Although some networks have been added in this article to aid training, these networks are not necessary during inference. In other words, our method is applicable to the situation where most BEV detection methods learn perspective-invariant features. To test the effectiveness of our framework, we choose to use BEVDepth for evaluation. The original loss of BEVDepth is used on the source domain as the main 3D detection supervision. In short, the final loss of the algorithm is: Table 1 shows the performance of different methods in domain generalization (DG) and unsupervised domain adaptation (UDA) ) Comparison of effects under the agreement. Among them, Target-Free represents the DG protocol, and Pseudo Label, Coral and AD are some common UDA methods. As can be seen from the graph, these methods all achieve significant improvements in the target domain. This suggests that semantic rendering serves as a bridge to help learn perspective-invariant features against domain shifts. Furthermore, these methods do not sacrifice the performance of the source domain and even provide some improvements in most cases. It should be mentioned in particular that DeepAccident is developed based on the Carla virtual engine. After training on DeepAccident, the algorithm has achieved satisfactory generalization capabilities. In addition, other BEV detection methods have been tested, but their generalization performance is very poor without special design. In order to further verify the ability to utilize unsupervised data sets in the target domain, a UDA benchmark was also established and UDA methods (including Pseudo Label, Coral and AD) were applied on DG-BEV. Experiments show that these methods have significant performance improvements. Implicit rendering makes full use of 2D detectors with better generalization performance to correct the false geometric information of 3D detectors. Furthermore, it is found that most algorithms tend to degrade the performance of the source domain, while our method is relatively mild. It is worth mentioning that AD and Coral show significant improvements when moving from virtual to real datasets, but show performance degradation in real tests. This is because these two algorithms are designed to handle style changes, but in scenes with small style changes, they may destroy semantic information. As for the Pseudo Label algorithm, it can improve the generalization performance of the model by increasing the confidence in some relatively good target domains, but blindly increasing the confidence in the target domain will actually make the model worse. The experimental results prove that the algorithm in this paper has achieved significant performance improvement in DG and UDA. The ablation experimental results on the three key components are shown in Table 2: 2D detection Device pre-training (DPT), source domain debiasing (SDB) and target domain debiasing (TDB). The experimental results show that each component has achieved improvements, among which SDB and TDB show relatively significant effects Table 3 shows the algorithm that can be migrated to BEVFormer and FB- On the OCC algorithm. Because this algorithm only requires additional operations on image features and BEV features, it can improve algorithms with BEV features. Figure 5 shows the detected unlabeled objects. The first row is the 3D box of the label, and the second row is the detection result of the algorithm. Blue boxes indicate that the algorithm can detect some unlabeled boxes. This shows that the method can even detect unlabeled samples in the target domain, such as vehicles that are too far away or in buildings on both sides of the street. This paper proposes a general multi-camera 3D object detection framework based on perspective depolarization, which can solve the object detection problem in unknown fields. The framework achieves consistent and accurate detection by projecting 3D detection results onto a 2D camera plane and correcting perspective bias. In addition, the framework also introduces a perspective debiasing strategy to enhance the robustness of the model by rendering images from different perspectives. Experimental results show that this method achieves significant performance improvements in domain generalization and unsupervised domain adaptation. In addition, this method can also be trained on virtual data sets without the need for real scene annotation, which provides convenience for real-time applications and large-scale deployment. These highlights demonstrate the method's challenges and potential in solving multi-camera 3D object detection. This paper attempts to use Nerf's ideas to improve the generalization ability of BEV, and can also use labeled source domain data and unlabeled target domain data. In addition, the experimental paradigm of Sim2Real was tried, which has potential value for autonomous driving closed loop. There are very good results from both qualitative and quantitative results, and the open source code is worth a look Original link: https://mp.weixin.qq.com/ s/GRLu_JW6qZ_nQ9sLiE0p2gViewing angle deviation definition
Detailed explanation of PD-BEV algorithm
Semantic Rendering
Source domain debiasing
Debiasing the target domain
Overall Supervision
Cross-domain experimental results
Summary
The above is the detailed content of NeRF's breakthrough in BEV generalization performance: the first cross-domain open source code successfully implements Sim2Real. For more information, please follow other related articles on the PHP Chinese website!

写在前面&笔者的个人理解三维Gaussiansplatting(3DGS)是近年来在显式辐射场和计算机图形学领域出现的一种变革性技术。这种创新方法的特点是使用了数百万个3D高斯,这与神经辐射场(NeRF)方法有很大的不同,后者主要使用隐式的基于坐标的模型将空间坐标映射到像素值。3DGS凭借其明确的场景表示和可微分的渲染算法,不仅保证了实时渲染能力,而且引入了前所未有的控制和场景编辑水平。这将3DGS定位为下一代3D重建和表示的潜在游戏规则改变者。为此我们首次系统地概述了3DGS领域的最新发展和关

您一定记得,尤其是如果您是Teams用户,Microsoft在其以工作为重点的视频会议应用程序中添加了一批新的3DFluent表情符号。在微软去年宣布为Teams和Windows提供3D表情符号之后,该过程实际上已经为该平台更新了1800多个现有表情符号。这个宏伟的想法和为Teams推出的3DFluent表情符号更新首先是通过官方博客文章进行宣传的。最新的Teams更新为应用程序带来了FluentEmojis微软表示,更新后的1800表情符号将为我们每天

0.写在前面&&个人理解自动驾驶系统依赖于先进的感知、决策和控制技术,通过使用各种传感器(如相机、激光雷达、雷达等)来感知周围环境,并利用算法和模型进行实时分析和决策。这使得车辆能够识别道路标志、检测和跟踪其他车辆、预测行人行为等,从而安全地操作和适应复杂的交通环境.这项技术目前引起了广泛的关注,并认为是未来交通领域的重要发展领域之一。但是,让自动驾驶变得困难的是弄清楚如何让汽车了解周围发生的事情。这需要自动驾驶系统中的三维物体检测算法可以准确地感知和描述周围环境中的物体,包括它们的位置、

当八卦开始传播新的Windows11正在开发中时,每个微软用户都对新操作系统的外观以及它将带来什么感到好奇。经过猜测,Windows11就在这里。操作系统带有新的设计和功能更改。除了一些添加之外,它还带有功能弃用和删除。Windows11中不存在的功能之一是Paint3D。虽然它仍然提供经典的Paint,它对抽屉,涂鸦者和涂鸦者有好处,但它放弃了Paint3D,它提供了额外的功能,非常适合3D创作者。如果您正在寻找一些额外的功能,我们建议AutodeskMaya作为最好的3D设计软件。如

ChatGPT给AI行业注入一剂鸡血,一切曾经的不敢想,都成为如今的基操。正持续进击的Text-to-3D,就被视为继Diffusion(图像)和GPT(文字)后,AIGC领域的下一个前沿热点,得到了前所未有的关注度。这不,一款名为ChatAvatar的产品低调公测,火速收揽超70万浏览与关注,并登上抱抱脸周热门(Spacesoftheweek)。△ChatAvatar也将支持从AI生成的单视角/多视角原画生成3D风格化角色的Imageto3D技术,受到了广泛关注现行beta版本生成的3D模型,

对于自动驾驶应用来说,最终还是需要对3D场景进行感知。道理很简单,车辆不能靠着一张图像上得到感知结果来行驶,就算是人类司机也不能对着一张图像来开车。因为物体的距离和场景的和深度信息在2D感知结果上是体现不出来的,而这些信息才是自动驾驶系统对周围环境作出正确判断的关键。一般来说,自动驾驶车辆的视觉传感器(比如摄像头)安装在车身上方或者车内后视镜上。无论哪个位置,摄像头所得到的都是真实世界在透视视图(PerspectiveView)下的投影(世界坐标系到图像坐标系)。这种视图与人类的视觉系统很类似,

一些原神“奇怪”的关键词,在这两天很有关注度,明明搜索指数没啥变化,却不断有热议话题蹦窜。例如了龙王、钟离等“转变”立绘激增,虽在网络上疯传了一阵子,但是经过追溯发现这些是合理、常规的二创同人。如果单是这些,倒也翻不起多大的热度。按照一部分网友的说法,除了原神自身就有热度外,发现了一件格外醒目的事情:原神3d同人作者shirakami已经被捕。这引发了不小的热议。为什么被捕?关键词,原神3D动画。还是越过了线(就是你想的那种),再多就不能明说了。经过多方求证,以及新闻报道,确实有此事。自从去年发

原标题:Radocc:LearningCross-ModalityOccupancyKnowledgethroughRenderingAssistedDistillation论文链接:https://arxiv.org/pdf/2312.11829.pdf作者单位:FNii,CUHK-ShenzhenSSE,CUHK-Shenzhen华为诺亚方舟实验室会议:AAAI2024论文思路:3D占用预测是一项新兴任务,旨在使用多视图图像估计3D场景的占用状态和语义。然而,由于缺乏几何先验,基于图像的场景


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Notepad++7.3.1
Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 Mac version
God-level code editing software (SublimeText3)
