DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!-AI-php.cn

Home

Technology peripherals

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

PHPz

Mar 21, 2024 pm 05:21 PM

technologyModel

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

This paper explores the problem of accurately detecting objects from different perspectives (such as perspective and bird's-eye views) in autonomous driving, especially how to effectively detect objects from perspective views (PV) to bird's eye view (BEV) spatial transformation features, this transformation is implemented through the visual transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn attention weights for the correspondence between 3D and 2D features through a Transformer, which increases the complexity of calculation and deployment.

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The paper points out that existing methods such as HeightFormer and FB-BEV try to combine these two VT strategies, but these methods usually adopt a two-stage strategy due to the characteristics of dual VT The transformations are different and are limited by the initial feature performance, thus hindering seamless fusion between dual VTs. Furthermore, these methods still face challenges in achieving real-time deployment of autonomous driving.

In response to these problems, the paper proposes a unified feature conversion method, suitable for 2D to 3D and 3D to 2D visual conversion, and uses three probability measurements to evaluate the correspondence between 3D and 2D features. : BEV probability, projection probability and image probability. This new method aims to alleviate the impact of blank areas in the BEV grid on feature construction, distinguish multiple correspondences, and exclude background features during the feature conversion process.

By applying this unified feature transformation, the paper explores a new method of 3D to 2D visual transformation using convolutional neural networks (CNN) and introduces a method called HeightTrans. In addition to demonstrating its superior performance, it also demonstrates the potential for acceleration through precomputation, making it suitable for real-time autonomous driving applications. At the same time, by integrating this feature transformation, the traditional LSS process is enhanced, demonstrating its universality to current detectors.

Combining HeightTrans and Prob-LSS, the paper introduces DualBEV, an innovative method that considers and fuses the correspondences from BEV and perspective views in one stage, eliminating the need for initial features dependence. In addition, a powerful BEV feature fusion module, called dual feature fusion (DFF) module, is proposed to further help refine BEV probability prediction by utilizing channel attention module and spatial attention module. DualBEV follows the principle of "extensive input, strict output" and understands and represents the probability distribution of the scene by utilizing precise dual-view probabilistic correspondence.

The main contributions of the paper are as follows:

Reveals the inherent similarity between 3D to 2D and 2D to 3D visual conversion, and proposes a unified feature conversion method that can accurately establish correspondences from both BEV and perspective views, showing This narrows the gap between the two strategies.
A new CNN-based 3D to 2D visual conversion method HeightTrans is proposed, which effectively and efficiently establishes accurate 3D-2D correspondence through probability sampling and pre-calculation of lookup tables.
DFF is introduced for dual-view feature fusion. This fusion strategy captures the information of near and far regions in one stage, thereby generating comprehensive BEV features.
Their efficient framework DualBEV achieves 55.2% mAP and 63.4% NDS on the nuScenes test set, even without using Transformer, highlighting the importance of capturing accurate dual-view correspondence for view transformation.

Through these innovations, the paper proposes a new strategy to overcome the limitations of existing methods and achieve more efficient and accurate object detection in real-time application scenarios such as autonomous driving.

Detailed explanation of DualBEV

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The method proposed in this paper aims to solve the problem of autonomous driving through a unified feature conversion framework, DualBEV. BEV (bird's eye view) object detection problem. Below are the main content of the Methods section, outlining its different sub-sections and key innovations.

DualBEV Overview

DualBEV’s processing flow starts from image features obtained from multiple cameras , and then uses SceneNet to generate instance masks And depth map . Next, features are extracted and transformed through the HeightTrans module and Prob-LSS pipeline, and finally these features are fused and used to predict the probability distribution of the BEV space , to get The final BEV features are used for subsequent tasks.

HeightTrans

HeightTrans is based on the principle of 3D to 2D visual conversion, by selecting and projecting 3D positions into image space, and evaluating these 3D-2D correspondences. This method first samples a set of 3D points in a predefined BEV map, and then carefully considers and filters these correspondences to generate BEV features. HeightTrans enhances attention to small objects and solves the misleading problem caused by background pixels by adopting a multi-resolution sampling strategy and a probability sampling method. In addition, the problem of blank BEV grid is solved by introducing BEV probability . The HeightTrans module is one of the key technologies proposed in the paper, focusing on processing and transforming features through 3D to 2D visual transformation (VT). It is based on selecting 3D locations from a predefined Bird's Eye View (BEV) map and projecting these locations into image space, thereby evaluating the correspondence between 3D and 2D. The following is a detailed introduction to how HeightTrans works:

BEV Height

The HeightTrans method adopts a multi-resolution sampling strategy when processing height, covering the entire height range ( from -5 meters to 3 meters), with a resolution of 0.5 meters within the region of interest (ROI, defined as -2 meters to 2 meters), and 1.0 meters outside this range. This strategy helps increase focus on small objects that may be missed in coarser resolution sampling.

Prob-Sampling

HeightTrans adopts the following steps in probability sampling:

Define 3D sampling points: Predefine a set of 3D sampling points , each point is defined by its position in 3D space .
Projection to 2D space: Use the camera’s external parameter matrix and internal parameter matrix to project 3D points to points in the 2D image space , where represents the depth of the point.
Feature Sampling: Use a bilinear grid sampler Sampling image features at the projected position :
Use instance mask: In order to avoid the projection position falling on the background pixel, use the instance mask generated by SceneNet to represent the image probability , and It is applied to image features to reduce the impact of misleading information:
Handling multiple correspondences: Using a trilinear grid sampler In the depth map evaluates the situation where multiple 3D points are mapped to the same 2D position, that is, the projection probability :
Introducing the BEV probability : In order to solve the gaps in the BEV grid Since the grid does not provide useful information, the BEV probability is introduced to represent the occupancy probability of the BEV grid, where is the position in the BEV space:

Acceleration

By pre-computing the index of 3D points in BEV space and fixing the image feature index and depth map index during inference, HeightTrans can accelerate the visual transformation process. The final HeightTrans feature extends traditional LSS (Lift, Splat, Shoot) by predefining

Prob-LSS

for each BEV mesh. Pipeline that facilitates projection of each pixel into BEV space by predicting its depth probability. This method further integrates BEV probabilities to construct LSS features through the following formula:

Doing so can better handle the uncertainty in depth estimation, thereby reducing redundant information in the BEV space.

Dual Feature Fusion (DFF)

The DFF module is designed to fuse features from HeightTrans and Prob-LSS and effectively predict BEV probability. By combining the channel attention module and the spatial attention-augmented ProbNet, DFF is able to optimize feature selection and BEV probability prediction to enhance the representation of near and distant objects. This fusion strategy takes into account the complementarity of features from the two streams while also enhancing the accuracy of BEV probability by calculating local and global attention.

In short, the DualBEV framework proposed in this paper achieves efficient evaluation and conversion of the correspondence between 3D and 2D features by combining HeightTrans and Prob-LSS, as well as an innovative dual feature fusion module. This not only bridges the gap between 2D to 3D and 3D to 2D conversion strategies, but also accelerates the feature conversion process through pre-computation and probability measurement, making it suitable for real-time autonomous driving applications.

The key to this method is the precise correspondence and efficient fusion of features from different viewing angles, thereby achieving excellent performance in BEV object detection.

Experiment

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The variant of the DualBEV method (DualBEV* with an asterisk) performs best under single-frame input conditions , achieving 35.2% mAP and 42.5% NDS, indicating that it surpasses other methods in both accuracy and comprehensive performance. Especially on mAOE, DualBEV* achieves a score of 0.542, which is the best among single-frame methods. However, its performance on mATE and mASE is not significantly better than other methods.

When the number of input frames is increased to two frames, the performance of DualBEV is further improved, with mAP reaching 38.0% and NDS reaching 50.4%. This is the highest NDS among all listed methods, indicating that DualBEV can handle more complex inputs. Understand the scenario more fully. Among multi-frame methods, it also shows strong performance in mATE, mASE, and mAAE, especially significant improvement in mAOE, showing its advantage in estimating object directions.

It can be analyzed from these results that DualBEV and its variants perform well on multiple important performance indicators, especially in multi-frame settings, indicating that it has better performance for BEV object detection tasks. accuracy and robustness. Furthermore, these results also highlight the importance of using multi-frame data to improve the overall performance and estimation accuracy of the model.

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The following is an analysis of the results of each ablation experiment:

Add ProbNet, HeightTrans, CAF (Channel Attention Fusion), SAE (Spatial Attention Fusion) Enhanced) and other components gradually improve the performance of Baseline.
The addition of HeightTrans significantly improves mAP and NDS, which shows that introducing height information into visual transformation is effective.
CAF further improves mAP, but slightly increases latency.
The introduction of SAE increased NDS to a maximum of 42.5%, and also improved mAP, indicating that the spatial attention mechanism effectively enhanced model performance.
Different probability measures (projection probability , image probability , BEV probability ) are gradually added to the comparative test.
The model achieved the highest mAP and NDS when all three probabilities were used simultaneously, indicating that the combination of these probabilities is critical to model performance.
Prob-Sampling has a higher NDS (39.0%) than other VT operations at a similar delay (0.32ms), which emphasizes the performance superiority of probabilistic sampling.
Multi-resolution (MR) sampling strategy can achieve similar or better performance than the uniform sampling strategy when using the same number of sampling points.
By adding projection probability, image probability and BEV probability to the LSS process, Prob-LSS outperforms other LSS variants, improving mAP and NDS, showing the effectiveness of combining these probabilities.
Compared with the multi-stage Refine strategy, both the single-stage Add strategy and the DFF module can achieve higher NDS, and DFF also has a slight improvement in mAP, which shows that As a single-stage fusion strategy, DFF is beneficial in terms of efficiency and performance.

Ablation experiments show that components and strategies such as HeightTrans, probabilistic measures, Prob-Sampling and DFF are crucial to improving model performance. In addition, the use of multi-resolution sampling strategy on height information also proves its effectiveness. These findings support the authors' argument that each of the techniques presented in the methods section contributes positively to model performance.

Discussion

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

This paper demonstrates the performance of its method through a series of ablation experiments. It can be seen from the experimental results that the DualBEV framework proposed in the paper and its various components have a positive impact on improving the accuracy of bird's-eye view (BEV) object detection.

The method of the paper gradually introduces ProbNet, HeightTrans, CAF (Channel Attention Fusion), and SAE (Spatial Attention Enhanced) modules into the baseline model, showing significant improvements in both mAP and NDS indicators. This proves that each component plays an important role in the overall architecture. Especially after the introduction of SAE, the NDS score increased to the highest point of 42.5%, while the delay only increased slightly, which shows that the method achieves a good balance between accuracy and delay.

The probabilistic ablation experimental results further confirm the importance of projection probability, image probability and BEV probability in improving detection performance. When these probabilities are introduced one by one, the mAP and NDS scores of the system improve steadily, demonstrating the importance of integrating these probabilistic measures into the BEV object detection task.

In the comparison of visual transformation (VT) operations, the Prob-Sampling method proposed in the paper shows lower latency and higher NDS score compared with other operations such as SCAda and Bilinear-Sampling, which Emphasizing its advantages in efficiency and performance. In addition, for different height sampling strategies, adopting a multi-resolution (MR) strategy instead of uniform sampling can further improve the NDS score, which demonstrates the importance of considering information at different heights in the scene to improve detection performance.

In addition, for different feature fusion strategies, the paper shows that the DFF method can still maintain a high NDS score while simplifying the model, which means that it is effective to fuse dual-stream features in a one-stage processing process .

However, although the method proposed in the paper performs well in many aspects, every improvement will also lead to an increase in system complexity and computational cost. For example, every time a new component is introduced (such as ProbNet, HeightTrans, etc.), the latency of the system will increase. Although the increase in latency is subtle, in applications with real-time or low-latency requirements, this may become a consideration. Furthermore, while probabilistic measures contribute to performance improvements, they also require additional computing resources to estimate these probabilities, potentially resulting in higher resource consumption.

The DualBEV method proposed in the paper has achieved remarkable results in improving the accuracy and comprehensive performance of BEV object detection, especially in combining the latest advances in deep learning with visual transformation technology. However, these advances come at the cost of slightly increased computational latency and resource consumption, and practical applications need to weigh these factors on a case-by-case basis.

Conclusion

This method performs well in the BEV object detection task, significantly improving accuracy and overall performance. By introducing probabilistic sampling, height transformation, attention mechanism and spatial attention augmentation network, DualBEV successfully improves multiple key performance indicators, especially in bird's-eye view (BEV) accuracy and scene understanding. Experimental results show that the paper's method is particularly effective in processing complex scenes and data from different perspectives, which is crucial for autonomous driving and other real-time monitoring applications.

The above is the detailed content of DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

What is Graph of Thought in Prompt EngineeringApr 13, 2025 am 11:53 AM

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Optimize Your Organisation's Email Marketing with GenAI AgentsApr 13, 2025 am 11:44 AM

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Real-Time App Performance Monitoring with Apache PinotApr 13, 2025 am 11:40 AM

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

ChatGPT Hits 1 Billion Users? 'Doubled In Just Weeks' Says OpenAI CEOApr 13, 2025 am 11:23 AM

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Pixtral-12B: Mistral AI's First Multimodal Model - Analytics VidhyaApr 13, 2025 am 11:20 AM

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Agentic Frameworks for Generative AI Applications - Analytics VidhyaApr 13, 2025 am 11:13 AM

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Applications of Generative AI in the Financial SectorApr 13, 2025 am 11:12 AM

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Guide to Online Learning and Passive-Aggressive AlgorithmsApr 13, 2025 am 11:09 AM

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

Chinese version, very easy to use

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Visual web development tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Hot Topics

Where is the login entrance for gmail email?

7486

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers