Home  >  Article  >  Technology peripherals  >  DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

PHPz
PHPzforward
2024-03-21 17:21:09357browse

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

This paper explores the problem of accurately detecting objects from different perspectives (such as perspective and bird's-eye views) in autonomous driving, especially how to effectively detect objects from perspective views (PV) to bird's eye view (BEV) spatial transformation features, this transformation is implemented through the visual transformation (VT) module. Existing methods are broadly divided into two strategies: 2D to 3D and 3D to 2D conversion. 2D-to-3D methods improve dense 2D features by predicting depth probabilities, but the inherent uncertainty of depth predictions, especially in distant regions, may introduce inaccuracies. While 3D to 2D methods usually use 3D queries to sample 2D features and learn attention weights for the correspondence between 3D and 2D features through a Transformer, which increases the complexity of calculation and deployment.

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The paper points out that existing methods such as HeightFormer and FB-BEV try to combine these two VT strategies, but these methods usually adopt a two-stage strategy due to the characteristics of dual VT The transformations are different and are limited by the initial feature performance, thus hindering seamless fusion between dual VTs. Furthermore, these methods still face challenges in achieving real-time deployment of autonomous driving.

In response to these problems, the paper proposes a unified feature conversion method, suitable for 2D to 3D and 3D to 2D visual conversion, and uses three probability measurements to evaluate the correspondence between 3D and 2D features. : BEV probability, projection probability and image probability. This new method aims to alleviate the impact of blank areas in the BEV grid on feature construction, distinguish multiple correspondences, and exclude background features during the feature conversion process.

By applying this unified feature transformation, the paper explores a new method of 3D to 2D visual transformation using convolutional neural networks (CNN) and introduces a method called HeightTrans. In addition to demonstrating its superior performance, it also demonstrates the potential for acceleration through precomputation, making it suitable for real-time autonomous driving applications. At the same time, by integrating this feature transformation, the traditional LSS process is enhanced, demonstrating its universality to current detectors.

Combining HeightTrans and Prob-LSS, the paper introduces DualBEV, an innovative method that considers and fuses the correspondences from BEV and perspective views in one stage, eliminating the need for initial features dependence. In addition, a powerful BEV feature fusion module, called dual feature fusion (DFF) module, is proposed to further help refine BEV probability prediction by utilizing channel attention module and spatial attention module. DualBEV follows the principle of "extensive input, strict output" and understands and represents the probability distribution of the scene by utilizing precise dual-view probabilistic correspondence.

The main contributions of the paper are as follows:

  1. Reveals the inherent similarity between 3D to 2D and 2D to 3D visual conversion, and proposes a unified feature conversion method that can accurately establish correspondences from both BEV and perspective views, showing This narrows the gap between the two strategies.
  2. A new CNN-based 3D to 2D visual conversion method HeightTrans is proposed, which effectively and efficiently establishes accurate 3D-2D correspondence through probability sampling and pre-calculation of lookup tables.
  3. DFF is introduced for dual-view feature fusion. This fusion strategy captures the information of near and far regions in one stage, thereby generating comprehensive BEV features.
  4. Their efficient framework DualBEV achieves 55.2% mAP and 63.4% NDS on the nuScenes test set, even without using Transformer, highlighting the importance of capturing accurate dual-view correspondence for view transformation.

Through these innovations, the paper proposes a new strategy to overcome the limitations of existing methods and achieve more efficient and accurate object detection in real-time application scenarios such as autonomous driving.

Detailed explanation of DualBEV

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The method proposed in this paper aims to solve the problem of autonomous driving through a unified feature conversion framework, DualBEV. BEV (bird's eye view) object detection problem. Below are the main content of the Methods section, outlining its different sub-sections and key innovations.

DualBEV Overview

DualBEV’s processing flow starts from image features obtained from multiple cameras , and then uses SceneNet to generate instance masks And depth map . Next, features are extracted and transformed through the HeightTrans module and Prob-LSS pipeline, and finally these features are fused and used to predict the probability distribution of the BEV space , to get The final BEV features are used for subsequent tasks.

HeightTrans

HeightTrans is based on the principle of 3D to 2D visual conversion, by selecting and projecting 3D positions into image space, and evaluating these 3D-2D correspondences. This method first samples a set of 3D points in a predefined BEV map, and then carefully considers and filters these correspondences to generate BEV features. HeightTrans enhances attention to small objects and solves the misleading problem caused by background pixels by adopting a multi-resolution sampling strategy and a probability sampling method. In addition, the problem of blank BEV grid is solved by introducing BEV probability . The HeightTrans module is one of the key technologies proposed in the paper, focusing on processing and transforming features through 3D to 2D visual transformation (VT). It is based on selecting 3D locations from a predefined Bird's Eye View (BEV) map and projecting these locations into image space, thereby evaluating the correspondence between 3D and 2D. The following is a detailed introduction to how HeightTrans works:

BEV Height

The HeightTrans method adopts a multi-resolution sampling strategy when processing height, covering the entire height range ( from -5 meters to 3 meters), with a resolution of 0.5 meters within the region of interest (ROI, defined as -2 meters to 2 meters), and 1.0 meters outside this range. This strategy helps increase focus on small objects that may be missed in coarser resolution sampling.

Prob-Sampling

HeightTrans adopts the following steps in probability sampling:

  1. Define 3D sampling points: Predefine a set of 3D sampling points , each point is defined by its position in 3D space .
  2. Projection to 2D space: Use the camera’s external parameter matrix and internal parameter matrix to project 3D points to points in the 2D image space , where represents the depth of the point.
  3. Feature Sampling: Use a bilinear grid sampler Sampling image features at the projected position :
  4. Use instance mask: In order to avoid the projection position falling on the background pixel, use the instance mask generated by SceneNet to represent the image probability , and It is applied to image features to reduce the impact of misleading information:
  5. Handling multiple correspondences: Using a trilinear grid sampler In the depth map evaluates the situation where multiple 3D points are mapped to the same 2D position, that is, the projection probability :
  6. Introducing the BEV probability : In order to solve the gaps in the BEV grid Since the grid does not provide useful information, the BEV probability is introduced to represent the occupancy probability of the BEV grid, where is the position in the BEV space:

Acceleration

By pre-computing the index of 3D points in BEV space and fixing the image feature index and depth map index during inference, HeightTrans can accelerate the visual transformation process. The final HeightTrans feature extends traditional LSS (Lift, Splat, Shoot) by predefining

Prob-LSS

for each BEV mesh. Pipeline that facilitates projection of each pixel into BEV space by predicting its depth probability. This method further integrates BEV probabilities to construct LSS features through the following formula:

Doing so can better handle the uncertainty in depth estimation, thereby reducing redundant information in the BEV space.

Dual Feature Fusion (DFF)

The DFF module is designed to fuse features from HeightTrans and Prob-LSS and effectively predict BEV probability. By combining the channel attention module and the spatial attention-augmented ProbNet, DFF is able to optimize feature selection and BEV probability prediction to enhance the representation of near and distant objects. This fusion strategy takes into account the complementarity of features from the two streams while also enhancing the accuracy of BEV probability by calculating local and global attention.

In short, the DualBEV framework proposed in this paper achieves efficient evaluation and conversion of the correspondence between 3D and 2D features by combining HeightTrans and Prob-LSS, as well as an innovative dual feature fusion module. This not only bridges the gap between 2D to 3D and 3D to 2D conversion strategies, but also accelerates the feature conversion process through pre-computation and probability measurement, making it suitable for real-time autonomous driving applications.

The key to this method is the precise correspondence and efficient fusion of features from different viewing angles, thereby achieving excellent performance in BEV object detection.

Experiment

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The variant of the DualBEV method (DualBEV* with an asterisk) performs best under single-frame input conditions , achieving 35.2% mAP and 42.5% NDS, indicating that it surpasses other methods in both accuracy and comprehensive performance. Especially on mAOE, DualBEV* achieves a score of 0.542, which is the best among single-frame methods. However, its performance on mATE and mASE is not significantly better than other methods.

When the number of input frames is increased to two frames, the performance of DualBEV is further improved, with mAP reaching 38.0% and NDS reaching 50.4%. This is the highest NDS among all listed methods, indicating that DualBEV can handle more complex inputs. Understand the scenario more fully. Among multi-frame methods, it also shows strong performance in mATE, mASE, and mAAE, especially significant improvement in mAOE, showing its advantage in estimating object directions.

It can be analyzed from these results that DualBEV and its variants perform well on multiple important performance indicators, especially in multi-frame settings, indicating that it has better performance for BEV object detection tasks. accuracy and robustness. Furthermore, these results also highlight the importance of using multi-frame data to improve the overall performance and estimation accuracy of the model.

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

The following is an analysis of the results of each ablation experiment:

  • Add ProbNet, HeightTrans, CAF (Channel Attention Fusion), SAE (Spatial Attention Fusion) Enhanced) and other components gradually improve the performance of Baseline.
  • The addition of HeightTrans significantly improves mAP and NDS, which shows that introducing height information into visual transformation is effective.
  • CAF further improves mAP, but slightly increases latency.
  • The introduction of SAE increased NDS to a maximum of 42.5%, and also improved mAP, indicating that the spatial attention mechanism effectively enhanced model performance.
  • Different probability measures (projection probability , image probability , BEV probability ) are gradually added to the comparative test.
  • The model achieved the highest mAP and NDS when all three probabilities were used simultaneously, indicating that the combination of these probabilities is critical to model performance.
  • Prob-Sampling has a higher NDS (39.0%) than other VT operations at a similar delay (0.32ms), which emphasizes the performance superiority of probabilistic sampling.
  • Multi-resolution (MR) sampling strategy can achieve similar or better performance than the uniform sampling strategy when using the same number of sampling points.
  • By adding projection probability, image probability and BEV probability to the LSS process, Prob-LSS outperforms other LSS variants, improving mAP and NDS, showing the effectiveness of combining these probabilities.
  • Compared with the multi-stage Refine strategy, both the single-stage Add strategy and the DFF module can achieve higher NDS, and DFF also has a slight improvement in mAP, which shows that As a single-stage fusion strategy, DFF is beneficial in terms of efficiency and performance.

Ablation experiments show that components and strategies such as HeightTrans, probabilistic measures, Prob-Sampling and DFF are crucial to improving model performance. In addition, the use of multi-resolution sampling strategy on height information also proves its effectiveness. These findings support the authors' argument that each of the techniques presented in the methods section contributes positively to model performance.

Discussion

DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!

This paper demonstrates the performance of its method through a series of ablation experiments. It can be seen from the experimental results that the DualBEV framework proposed in the paper and its various components have a positive impact on improving the accuracy of bird's-eye view (BEV) object detection.

The method of the paper gradually introduces ProbNet, HeightTrans, CAF (Channel Attention Fusion), and SAE (Spatial Attention Enhanced) modules into the baseline model, showing significant improvements in both mAP and NDS indicators. This proves that each component plays an important role in the overall architecture. Especially after the introduction of SAE, the NDS score increased to the highest point of 42.5%, while the delay only increased slightly, which shows that the method achieves a good balance between accuracy and delay.

The probabilistic ablation experimental results further confirm the importance of projection probability, image probability and BEV probability in improving detection performance. When these probabilities are introduced one by one, the mAP and NDS scores of the system improve steadily, demonstrating the importance of integrating these probabilistic measures into the BEV object detection task.

In the comparison of visual transformation (VT) operations, the Prob-Sampling method proposed in the paper shows lower latency and higher NDS score compared with other operations such as SCAda and Bilinear-Sampling, which Emphasizing its advantages in efficiency and performance. In addition, for different height sampling strategies, adopting a multi-resolution (MR) strategy instead of uniform sampling can further improve the NDS score, which demonstrates the importance of considering information at different heights in the scene to improve detection performance.

In addition, for different feature fusion strategies, the paper shows that the DFF method can still maintain a high NDS score while simplifying the model, which means that it is effective to fuse dual-stream features in a one-stage processing process .

However, although the method proposed in the paper performs well in many aspects, every improvement will also lead to an increase in system complexity and computational cost. For example, every time a new component is introduced (such as ProbNet, HeightTrans, etc.), the latency of the system will increase. Although the increase in latency is subtle, in applications with real-time or low-latency requirements, this may become a consideration. Furthermore, while probabilistic measures contribute to performance improvements, they also require additional computing resources to estimate these probabilities, potentially resulting in higher resource consumption.

The DualBEV method proposed in the paper has achieved remarkable results in improving the accuracy and comprehensive performance of BEV object detection, especially in combining the latest advances in deep learning with visual transformation technology. However, these advances come at the cost of slightly increased computational latency and resource consumption, and practical applications need to weigh these factors on a case-by-case basis.

Conclusion

This method performs well in the BEV object detection task, significantly improving accuracy and overall performance. By introducing probabilistic sampling, height transformation, attention mechanism and spatial attention augmentation network, DualBEV successfully improves multiple key performance indicators, especially in bird's-eye view (BEV) accuracy and scene understanding. Experimental results show that the paper's method is particularly effective in processing complex scenes and data from different perspectives, which is crucial for autonomous driving and other real-time monitoring applications.

The above is the detailed content of DualBEV: significantly surpassing BEVFormer and BEVDet4D, open the book!. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete