Home >Technology peripherals >AI >Choose camera or lidar? A recent review on achieving robust 3D object detection

Choose camera or lidar? A recent review on achieving robust 3D object detection

WBOY
WBOYforward
2024-01-26 11:18:281338browse

0. Written in front&&Personal understanding

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Autonomous driving systems rely on advanced perception, decision-making and control technologies, by using various Sensors (such as cameras, lidar, radar, etc.) are used to sense the surrounding environment and use algorithms and models for real-time analysis and decision-making. This enables vehicles to recognize road signs, detect and track other vehicles, predict pedestrian behavior, etc., thereby safely operating and adapting to complex traffic environments. This technology is currently attracting widespread attention and is considered an important development area in the future of transportation. one. But what makes autonomous driving difficult is figuring out how to make the car understand what's going on around it. This requires 3D object detection algorithms in autonomous driving systems that can accurately perceive and describe objects in the surrounding environment, including their location, shape, size and category. This comprehensive environmental awareness helps autonomous driving systems better understand the driving environment and make more precise decisions.

We conducted a comprehensive evaluation of 3D object detection algorithms in autonomous driving, mainly considering robustness. Three key factors were identified in the evaluation: environmental variability, sensor noise, and misalignment. These factors are important for the performance of detection algorithms under real-world changing conditions.

  1. Environmental variability: The article emphasizes that the detection algorithm needs to adapt to different environmental conditions, such as changes in lighting, weather, and seasons.
  2. Sensor noise: The algorithm must effectively deal with sensor noise, which may include camera motion blur and other issues.
  3. Misalignment: For misalignment caused by calibration errors or other factors, the algorithm needs to take these factors into account, whether they are external (such as uneven road surfaces) or internal (e.g. system clock misalignment).

also dives into three key areas of performance evaluation: accuracy, latency, and robustness.

  • Accuracy: Although studies often focus on accuracy as a key performance indicator, performance under complex and extreme conditions requires a deeper understanding to ensure real-world reliability sex.
  • Latency: Real-time capabilities in autonomous driving are crucial. Delays in detection methods impact the system's ability to make timely decisions, especially in emergency situations.
  • Robustness: Calls for a more comprehensive assessment of the stability of systems under different conditions, as many current assessments may not fully account for the diversity of real-world scenarios.

The paper points out the significant advantages of multi-modal 3D detection methods in safety perception. By fusing data from different sensors, it provides richer and diversified perception capabilities, thereby improving the automatic driving system. security.

1. Dataset

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

The above briefly introduces the 3D object detection data set used in autonomous driving systems, focusing mainly on Evaluate the advantages and limitations of different sensor modalities, as well as the characteristics of public datasets.

First, the table shows three types of sensors: camera, point cloud, and multimodal (camera and lidar). For each type, their hardware costs, advantages, and limitations are listed. The advantage of camera data is that it provides rich color and texture information, but its limitations are its lack of depth information and its susceptibility to light and weather effects. LiDAR can provide accurate depth information, but is expensive and has no color information.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Next, there are some other public datasets available for 3D object detection in autonomous driving. These data sets include KITTI, nuScenes, Waymo, etc. Details of these datasets are as follows: - The KITTI dataset contains data released in multiple years, using different types of sensors. It provides a large number of frames and annotations, as well as a variety of scenes, including scene numbers and categories, and different scene types such as day, sunny, night, rainy, etc. - The nuScenes dataset is also an important dataset, which also contains data released in multiple years. This dataset uses a variety of sensors and provides a large number of frames and annotations. It covers a variety of scenarios, including different scene numbers and categories, as well as various scene types. - The Waymo dataset is another dataset for autonomous driving that also has data from multiple years. This dataset uses different types of sensors and provides a rich number of frames and annotations. It covers various scenarios

Additionally, research on “clean” autonomous driving datasets is mentioned, and the importance of evaluating model robustness under noisy scenarios is emphasized. Some studies focus on camera single-modality methods under harsh conditions, while other multi-modal datasets focus on noise issues. For example, the GROUNDED dataset focuses on ground-penetrating radar positioning under different weather conditions, while the ApolloScape open dataset includes lidar, camera and GPS data, covering a variety of weather and lighting conditions.

Due to the prohibitive cost of collecting large-scale noisy data in the real world, many studies turn to the use of synthetic datasets. For example, ImageNet-C is a benchmark study in combating common perturbations in image classification models. This research direction was subsequently extended to robust datasets tailored for 3D object detection in autonomous driving.

2. Vision-based 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

##2.1 Monocular 3D object detection

In this part, the concept of monocular 3D object detection and three main methods are discussed: prior-based monocular 3D object detection, camera-only monocular 3D object detection, and depth-assisted monocular 3D object detection. detection.

Prior-guided monocular 3D object detection
This method utilizes prior knowledge of object shapes and scene geometry hidden in the image to solve monocular 3D objects Detection challenges. By introducing pre-trained sub-networks or auxiliary tasks, prior knowledge can provide additional information or constraints to help accurately locate 3D objects and enhance the accuracy and robustness of detection. Common prior knowledge includes object shape, geometric consistency, temporal constraints and segmentation information. For example, the Mono3D algorithm first assumes that the 3D object lies on a fixed ground plane, and then uses the object's prior 3D shape to reconstruct the bounding box in 3D space.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Camera-only monocular 3D object detection
This method uses only images captured by a single camera to detect and localize 3D objects . It uses a convolutional neural network (CNN) to directly regress 3D bounding box parameters from images to estimate the size and pose of objects in three-dimensional space. This direct regression method can be trained in an end-to-end manner, promoting overall learning and inference of 3D objects. For example, the Smoke algorithm abandons the regression of 2D bounding boxes and predicts the 3D box of each detected object by combining the estimation of individual keypoints and the regression of 3D variables.

Depth-assisted monocular 3D object detection
Depth estimation plays a key role in depth-assisted monocular 3D object detection. To achieve more accurate monocular detection results, many studies utilize pre-trained auxiliary depth estimation networks. The process starts by converting the monocular image into a depth image by using a pre-trained depth estimator such as MonoDepth. Then, two main methods are adopted to process depth images and monocular images. For example, the Pseudo-LiDAR detector uses a pretrained depth estimation network to generate Pseudo-LiDAR representations, but there is a huge performance gap between Pseudo-LiDAR and LiDAR-based detectors due to errors in image-to-LiDAR generation.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

Through the exploration and application of these methods, monocular 3D object detection has made significant progress in the fields of computer vision and intelligent systems, bringing breakthroughs and opportunities to these fields.

2.2 Stereo-based 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

In this part, the 3D object detection technology based on stereo vision is discussed . Stereo vision 3D object detection utilizes a pair of stereoscopic images to identify and locate 3D objects. By exploiting dual views captured by stereo cameras, these methods excel in obtaining high-precision depth information through stereo matching and calibration, which is a feature that differentiates them from monocular camera setups. Despite these advantages, stereo vision methods still suffer from a considerable performance gap compared to lidar-based methods. Furthermore, the area of ​​3D object detection from stereo images is relatively little explored, with only limited research efforts dedicated to this area.

  1. 2D-detection based methods: The traditional 2D object detection framework can be modified to solve the stereo detection problem. For example, Stereo R-CNN uses an image-based 2D detector to predict 2D proposals, generating left and right regions of interest (RoIs) for the corresponding left and right images. Subsequently, in the second stage, it directly estimates the 3D object parameters based on the previously generated RoIs. This paradigm was widely adopted in subsequent work.
  2. Pseudo-LiDAR based methods: The disparity map predicted from the stereo image can be converted into a depth map and further converted into pseudo LiDAR points. Therefore, similar to monocular detection methods, pseudo-lidar representation can also be used in stereo vision-based 3D object detection methods. These methods aim to enhance disparity estimation in stereo matching to achieve more accurate depth prediction. For example, Wang et al. were pioneers in introducing pseudo-lidar representation. This representation is generated from an image with a depth map, requiring the model to perform depth estimation tasks to assist detection. Subsequent work followed this paradigm and refined it by introducing additional color information to enhance pseudo-point clouds, auxiliary tasks (such as instance segmentation, foreground and background segmentation, domain adaptation) and coordinate transformation schemes. It is worth noting that PatchNet proposed by Ma et al. challenges the traditional concept of using pseudo-lidar representation for monocular 3D object detection. By encoding 3D coordinates for each pixel, PatchNet can achieve comparable monocular detection results without pseudo-lidar representation. This observation suggests that the power of the pseudo-lidar representation comes from the coordinate transformation rather than the point cloud representation itself.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

2.3 Multi-view 3D object detection

Recently, multi-view 3D object detection has improved in terms of accuracy and robustness. Compared with the aforementioned monocular and stereo vision 3D object detection methods, it shows superiority. Unlike LiDAR-based 3D object detection, the latest panoramic Bird's Eye View (BEV) method eliminates the need for high-precision maps and elevates detection from 2D to 3D. This progress has led to significant developments in multi-view 3D object detection. In multi-camera 3D object detection, the key challenge is to identify the same object in different images and aggregate body features from multiple viewpoint inputs. Current methods involve uniformly mapping multiple views into Bird's Eye View (BEV) space, which is a common practice.

Depth-based Multi-view methods:

Direct conversion from 2D to BEV space poses a significant challenge. LSS is the first to propose a depth-based method, which utilizes 3D space as an intermediary. This method first predicts the grid depth distribution of 2D features and then lifts these features into voxel space. This approach offers hope for more efficient transformation from 2D to BEV space. Following LSS, CaDDN adopts a similar deep representation method. By compressing voxel space features into BEV space, it performs the final 3D detection. It is worth noting that CaDDN is not part of multi-view 3D object detection, but single-view 3D object detection, which has had an impact on subsequent in-depth research. The main difference between LSS and CaDDN is that CaDDN uses actual ground-truth depth values ​​to supervise the prediction of its classification depth distribution, thus creating a superior deep network capable of extracting 3D information from 2D space more accurately.

Query-based Multi-view methods

Under the influence of Transformer technology, query-based multi-view methods retrieve 2D space features from 3D space. DETR3D introduces 3D object query to solve the aggregation problem of multi-view features. It obtains image features in Bird's Eye View (BEV) space by clipping image features from different viewpoints and projecting them into 2D space using learned 3D reference points. Different from the depth-based multi-view method, the query-based multi-view method obtains sparse BEV features by using reverse query technology, which fundamentally affects the subsequent query-based development. However, due to potential inaccuracies associated with explicit 3D reference points, PETR adopted an implicit position encoding method to construct the BEV space, affecting subsequent work.

2.4 Analysis: Accuracy, Latency, Robustness

Currently, 3D object detection solutions based on Bird’s Eye View (BEV) perception are developing rapidly. Despite the existence of many review articles, a comprehensive review of this field is still insufficient. Shanghai AI Lab and SenseTime Research Institute provide an in-depth review of the technology roadmap for BEV solutions. However, unlike existing reviews, we consider key aspects such as autonomous driving safety perception. After analyzing the technology roadmap and current development status of camera-based solutions, we intend to discuss based on the basic principles of `Accuracy, Latency, Robustness'. We will integrate the perspective of safety awareness to guide the practical implementation of safety awareness in autonomous driving.

  1. Accuracy: There is a lot of focus on accuracy in most research articles and reviews, and it is really important. Although accuracy can be reflected by AP (average precision), considering AP alone may not provide a comprehensive perspective as different methods may exhibit significant differences due to different paradigms. As shown in the figure, we selected 10 representative methods for comparison, and the results show that there are significant metric differences between monocular 3D object detection and stereoscopic 3D object detection. The current situation shows that the accuracy of monocular 3D object detection is much lower than that of stereoscopic 3D object detection. Stereo vision 3D object detection utilizes images captured from two different perspectives of the same scene to obtain depth information. The larger the baseline between cameras, the wider the range of depth information captured. Over time, multi-view (bird's-eye view perception) 3D object detection gradually replaced monocular methods, significantly improving mAP. The increase in the number of sensors has a significant impact on mAP.
  2. Latency: In the field of autonomous driving, latency is crucial. It refers to the time it takes for a system to react to an input signal, including the entire process from sensor data collection to system decision-making and execution of actions. In autonomous driving, the requirements for latency are very strict, as any form of latency can lead to serious consequences. The importance of latency in autonomous driving is reflected in the following aspects: real-time responsiveness, safety, user experience, interactivity and emergency response. In the field of 3D object detection, latency (frames per second, FPS) and accuracy are key indicators for evaluating algorithm performance. As shown in the figure, the graph of monocular and stereo vision 3D object detection shows the average accuracy (AP) versus FPS for equal difficulty levels in the KITTI dataset. For the implementation of autonomous driving, 3D object detection algorithms must strike a balance between latency and accuracy. While monocular detection is fast, it lacks accuracy; conversely, stereo and multi-view methods are accurate but slower. Future research should not only maintain high accuracy, but also pay more attention to improving FPS and reducing latency to meet the dual requirements of real-time responsiveness and safety in autonomous driving.
  3. Robustness: Robustness is a key factor in autonomous driving safety perceptions and represents an important topic that has been previously overlooked in comprehensive reviews. This aspect is often not addressed in current well-designed clean datasets and benchmarks such as KITTI, nuScenes, and Waymo. Currently, research works such as RoboBEV and Robo3D incorporate robustness considerations in 3D object detection, such as sensor loss and other factors. They employ a methodology that involves introducing perturbations into datasets related to 3D object detection to assess robustness. This includes the introduction of various types of noise, such as changes in weather conditions, sensor failures, motion disturbances and object-related perturbations, aiming to reveal the different effects of different noise sources on the model. Typically, most papers studying robustness are evaluated by introducing noise to the validation set of clean datasets (such as KITTI, nuScenes, and Waymo). Additionally, we highlight the findings in Ref., which highlight KITTI-C and nuScenes-C as examples of camera-only 3D object detection methods. The table provides an overall comparison showing that overall the camera-only approach is less robust than the lidar-only and multi-model fusion approaches. They are very susceptible to various types of noise. In KITTI-C, three representative works—SMOKE, PGD, and ImVoxelNet—show consistently lower overall performance and reduced robustness to noise. In nuScenes-C, noteworthy methods such as DETR3D and BEVFormer show greater robustness compared to FCOS3D and PGD, indicating that the overall robustness increases as the number of sensors increases. In summary, future camera-only approaches need to consider not only cost factors and accuracy metrics (mAP, NDS, etc.), but also factors related to safety perception and robustness. Our analysis aims to provide valuable insights into the safety of future autonomous driving systems.

3. Lidar-based 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

The voxel-based 3D object detection method proposes to combine sparse Point clouds are segmented and assigned into regular voxels, resulting in a dense data representation, a process called voxelization. Compared with view-based methods, voxel-based methods use spatial convolution to effectively perceive 3D spatial information and achieve higher detection accuracy, which is crucial for safety perception in autonomous driving. However, these methods still face the following challenges:

  1. High Computational Complexity: Compared with camera-based methods, voxel-based methods require large amounts of memory and computing resources because of the huge number of voxels used to represent the 3D space.
  2. Loss of spatial information: Due to the discretization characteristics of voxels, details and shape information may be lost or blurred during the voxelization process. At the same time, the limited resolution of voxels makes it difficult to detect accurately. Small objects.
  3. Scale and density inconsistency: Voxel-based methods usually require detection on voxel grids of different scales and densities, but due to the changes in scale and density of targets in different scenes is large, choosing the right scale and density to suit different goals becomes a challenge.

In order to overcome these challenges, it is necessary to solve the limitations of data representation, improve network feature capabilities and target positioning accuracy, and strengthen the algorithm's understanding of complex scenes. Although optimization strategies vary, they generally aim to optimize both data representation and model structure.

3.1 Voxel-based 3D object detection

Thanks to the prosperity of PC in deep learning, point-based 3D object detection inherits many of its frameworks and proposes Detect 3D objects directly from original points without preprocessing. Compared with voxel-based methods, the original point cloud retains the maximum amount of original information, which is beneficial to fine-grained feature acquisition and results in high accuracy. At the same time, a series of work on PointNet naturally provides a strong foundation for point-based methods. Point-based 3D object detectors have two basic components: point cloud sampling and feature learning. As of now, the performance of point-based methods is still affected by two factors: the number of context points and the context radius adopted in feature learning. . e.g. Increasing the number of context points can obtain more detailed 3D information, but will significantly increase the model's inference time. Similarly, reducing the context radius can have the same effect. Therefore, choosing appropriate values ​​for these two factors can allow the model to achieve a balance between accuracy and speed. In addition, since each point in the point cloud needs to be calculated, the point cloud sampling process is the main factor limiting the real-time operation of point-based methods. Specifically, to solve the above problems, most existing methods are optimized around two basic components of point-based 3D object detectors: 1) Point Sampling 2) feature learning

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

3.2 Point-based 3D object detection

The point-based 3D object detection method inherits many deep learning frameworks and proposes to detect 3D objects directly from the original point cloud, while No preprocessing is performed. Compared with voxel-based methods, the original point cloud retains the original information to the maximum extent, which is conducive to the acquisition of fine-grained features, thereby achieving high accuracy. At the same time, the PointNet series of work provides a strong foundation for point-based methods. However, so far, the performance of point-based methods is still affected by two factors: the number of context points and the context radius used in feature learning. For example, increasing the number of context points can obtain more detailed 3D information, but will significantly increase the model's inference time. Similarly, reducing the context radius achieves the same effect. Therefore, choosing appropriate values ​​for these two factors allows the model to achieve a balance between accuracy and speed. In addition, the point cloud sampling process is the main factor limiting the real-time operation of point-based methods due to the need to perform calculations for each point in the point cloud. To solve these problems, existing methods mainly optimize around two basic components of point-based 3D object detectors: 1) point cloud sampling; 2) feature learning.

Farthest Point Sampling (FPS) is derived from PointNet and is a point cloud sampling method widely used in point-based methods. Its goal is to select a representative set of points from the original point cloud to maximize the distance between them to best cover the spatial distribution of the entire point cloud. PointRCNN is a groundbreaking two-stage detector among point-based methods, using PointNet as the backbone network. In the first stage, it generates 3D proposals from point clouds in a bottom-up manner. In the second stage, the proposals are refined by combining semantic features and local spatial features. However, existing FPS-based methods still face some problems: 1) Points unrelated to detection also participate in the sampling process, bringing additional computational burden; 2) Points are unevenly distributed in different parts of the object, resulting in suboptimal sampling strategies . To address these issues, subsequent work adopted an FPS-like design paradigm and made improvements, such as background point filtering guided by segmentation, random sampling, feature space sampling, voxel-based sampling, and ray grouping-based sampling.

The feature learning stage of point-based 3D object detection methods aims to extract discriminative feature representations from sparse point cloud data. The neural network used in the feature learning stage should have the following characteristics: 1) Invariance, the point cloud backbone network should be insensitive to the order of the input point cloud; 2) It has local perception capabilities and can sense and model local areas, and extract Local features; 3) The ability to integrate context information and extract features from global and local context information. Based on the above characteristics, a large number of detectors are designed to process raw point clouds. Most methods can be divided according to the core operators used: 1) PointNet-based methods; 2) Graph neural network-based methods; 3) Transformer-based methods.

PointNet-based methods

PointNet-based methods mainly rely on set abstraction to downsample original points, aggregate local information, and integrate contextual information while maintaining the original Symmetry invariance of points. Point-RCNN is the first two-stage work among point-based methods and achieves excellent performance, but still faces the problem of high computational cost. Subsequent work solved this problem by introducing an additional semantic segmentation task in the detection process to filter out background points that contribute minimally to detection.

Methods based on graph neural networks

Graph neural networks (GNN) have adaptive structures, dynamic neighborhoods, the ability to build local and global context relationships, and the ability to Robustness of regular sampling. Point-GNN is a pioneering work that designs a single-stage graph neural network to predict the category and shape of objects through automatic registration mechanism, merging and scoring operations, demonstrating the use of graph neural networks as a new method for 3D object detection. potential.

Transformer-based methods

In recent years, Transformer (Transformer) has been explored in point cloud analysis and has performed well on many tasks. For example, Pointformer introduces local and global attention modules to process 3D point clouds, the local Transformer module is used to model interactions between points in local regions, and the global Transformer aims to learn scene-level context-aware representations. Group-free directly utilizes all points in the point cloud to calculate the features of each object candidate, where the contribution of each point is determined by an automatically learned attention module. These methods demonstrate the potential of Transformer-based methods in processing unstructured and unordered raw point clouds.

3.3 Point-Voxel based 3D object detection

Point cloud-based 3D object detection methods provide high resolution and retain the spatial structure of the original data, but they Face high computational complexity and inefficiency when dealing with sparse data. In contrast, voxel-based methods provide structured data representation, improve computational efficiency, and facilitate the application of traditional convolutional neural network technology. However, they often lose fine spatial details due to the discretization process. To solve these problems, point-voxel (PV) based methods were developed. Point-voxel methods aim to exploit the fine-grained information capturing capabilities of point-based methods and the computational efficiency of voxel-based methods. By integrating these methods, point-voxel based methods can process point cloud data in more detail, capturing global structure and micro-geometric details. This is crucial for safety perception in autonomous driving, because the decision-making accuracy of the autonomous driving system depends on high-precision detection results.

The key goal of the point-voxel method is to achieve feature interaction between voxels and points through point-to-voxel or voxel-to-point conversion. Many works have explored the idea of ​​utilizing point-voxel feature fusion in backbone networks. These methods can be divided into two categories: 1) early fusion; 2) late fusion.

a) Early Fusion: Some methods have explored the use of new convolution operators to fuse voxel and point features, and PVCNN may be the first work in this direction. In this approach, the voxel-based branch first converts points into a low-resolution voxel grid and aggregates neighboring voxel features through convolution. Then, through a process called devoxelization, the voxel-level features are converted back to point-level features and fused with features obtained by the point-based branch. The point-based branch extracts features for each individual point. Since it does not aggregate neighborhood information, this method can run at higher speeds. Then, SPVCNN was extended to the field of object detection based on PVCNN. Other methods try to improve from different perspectives, such as auxiliary tasks or multi-scale feature fusion.

b) Post-fusion: This series of methods mainly uses a two-stage detection framework. First, preliminary object proposals are generated using a voxel-based approach. Then, point-level features are used to accurately divide the detection frame. The PV-RCNN proposed by Shi et al. is a milestone in point-voxel based methods. It uses SECOND as the first-stage detector and proposes a second-stage refinement stage with RoI grid pooling for the fusion of keypoint features. Subsequent work mainly follows the above paradigm and focuses on the progress of second-stage detection. Notable developments include attention mechanisms, scale-aware pooling, and point density-aware refinement modules.

Point-voxel-based methods have both the computational efficiency of voxel-based methods and the ability to capture fine-grained information based on point-based methods. However, constructing point-to-voxel or voxel-to-point relationships, as well as feature fusion of voxels and points, will bring additional computational overhead. Therefore, point-voxel based methods can achieve better detection accuracy compared to voxel-based methods, but at the cost of increased inference time.

4. Multi-modal 3D object detection

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!##4.1 Projection-based 3D object detection

The projection-based 3D object detection method uses the projection matrix in the feature fusion stage to integrate point cloud and image features. The key here is to focus on projection during feature fusion, rather than other projection processes in the fusion stage, such as data augmentation, etc. According to the different types of projections used in the fusion stage, projection-based 3D object detection methods can be further subdivided into the following categories:

3D object detection based on point projection
    : This type of method enhances the representation ability of original point cloud data by projecting image features onto the original point cloud. The first step in these methods is to use a calibration matrix to establish strong correlations between lidar points and image pixels. Next, the point cloud features are enhanced by adding additional data. This enhancement comes in two forms: one by merging segmentation scores (like PointPainting), and the other using CNN features from relevant pixels (like MVP). PointPainting enhances lidar points by appending segmentation scores, but has limitations in effectively capturing color and texture details in images. To solve these problems, more sophisticated methods such as FusionPainting were developed.
  1. 3D object detection based on feature projection
  2. : Different from methods based on point projection, this type of method mainly focuses on fusing point cloud features with image features in the point cloud feature extraction stage. In this process, point cloud and image modalities are effectively fused by applying a calibration matrix to transform the 3D coordinate system of voxels into the pixel coordinate system of the image. For example, ContFuse fuses multi-scale convolutional feature maps through continuous convolution.
  3. 3D object detection based on automatic projection
  4. : Many studies perform fusion through direct projection, but do not solve the problem of projection error. Some works (such as AutoAlignV2) mitigate these errors by learning offsets and neighborhood projections, etc. For example, HMFI, GraphAlign and GraphAlign utilize prior knowledge of the projection calibration matrix for image projection and local graph modeling.
  5. Decision projection-based 3D object detection
  6. : This type of method uses a projection matrix to align features in a region of interest (RoI) or a specific result. For example, Graph-RCNN projects graph nodes to positions in a camera image and collects feature vectors for that pixel in the camera image through bilinear interpolation. F-PointNet determines the category and positioning of objects through 2D image detection, and obtains point clouds in the corresponding 3D space through calibrated sensor parameters and transformation matrices in 3D space.
  7. These methods show how to use projection technology to achieve feature fusion in multi-modal 3D object detection, but they still have certain limitations in handling the interaction between different modalities and accuracy. .

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!4.2 Non-Projection-based 3D object detection

##Non-Projection-based 3D object detection The detection method achieves fusion by not relying on feature alignment, resulting in robust feature representation. They circumvent the limitations of camera-to-lidar projection, which often reduces the semantic density of camera features and affects the effectiveness of techniques such as Focals Conv and PointPainting. Non-projective methods usually adopt a cross-attention mechanism or construct a unified space to solve the inherent misalignment problem in direct feature projection. These methods are mainly divided into two categories: (1) query learning-based methods and (2) unified feature-based methods. Query learning-based methods completely avoid the need for alignment during the fusion process. In contrast, unified feature-based methods, although constructing a unified feature space, do not completely avoid projection; it usually occurs in a single modality context. For example, BEVFusion utilizes LSS for camera-to-BEV projection. This process occurs before fusion and shows considerable robustness in scenarios where features are misaligned.

  1. Three-dimensional object detection based on query learning: Three-dimensional object detection methods based on query learning, such as Transfusion, DeepFusion, DeepInteraction, autoalign, CAT-Det, MixedFusion, etc., avoid the feature fusion process projection requirements in . Instead, they achieve feature alignment before performing feature fusion through a cross-attention mechanism. Point cloud features are usually used as queries, and image features are used as keys and values. Highly robust multi-modal features are obtained through global feature queries. In addition, DeepInteraction introduces multi-modal interaction, in which point cloud and image features are used as different queries to achieve further feature interaction. Comprehensive integration of image features leads to the acquisition of more robust multi-modal features compared to using only point cloud features as queries. In general, the three-dimensional object detection method based on query learning uses a Transformer-based structure to perform feature query to achieve feature alignment. Eventually, multimodal features were integrated into lidar-based processes such as CenterPoint.
  2. Three-dimensional object detection based on unified features: Three-dimensional object detection methods based on unified features, such as EA-BEV, BEVFusion, cai2023bevfusion4d, FocalFormer3D, FUTR3D, UniTR, Uni3D, virconv, MSMDFusion, sfd, cmt, UVTR, sparsefusion, etc. usually achieve pre-fusion unification of heterogeneous modalities through projection before feature fusion. In the BEV fusion series, LSS is used for depth estimation, the front-view features are converted into BEV features, and then the BEV image and BEV point cloud features are fused. On the other hand, CMT and UniTR use Transformer for tokenization of point clouds and images, and construct an implicit unified space through Transformer encoding. CMT uses projection in the position encoding process but completely avoids reliance on projection relationships at the feature learning level. FocalFormer3D, FUTR3D and UVTR use Transformer's query to implement a solution similar to DETR3D, and build a unified sparse BEV feature space through query, thus alleviating the instability caused by direct projection.

VirConv, MSMDFusion and SFD construct a unified space through pseudo point clouds, and projection occurs before feature learning. The problems introduced by direct projection are solved through subsequent feature learning. In summary, unified feature-based 3D object detection methods currently represent highly accurate and robust solutions. Although they contain a projection matrix, this projection does not occur between multi-modal fusions and is therefore considered a non-projective 3D object detection method. Different from automatic projection 3D object detection methods, they do not directly solve the problem of projection error, but choose to construct a unified space and consider multiple dimensions of multimodal 3D object detection to obtain highly robust multimodal features.

5. Conclusion

3D object detection plays a vital role in autonomous driving perception. In recent years, this field has developed rapidly and produced a large number of research papers. Based on the diverse data forms generated by sensors, these methods are mainly divided into three types: image-based, point cloud-based and multi-modal. The main evaluation metrics of these methods are high accuracy and low latency. Many reviews summarize these approaches, focusing mainly on the core principles of `high accuracy and low latency', describing their technical trajectories.

However, in the process of autonomous driving technology moving from breakthroughs to practical applications, existing reviews do not take safety perception as the core focus and fail to cover the current technical paths related to safety perception. For example, recent multimodal fusion methods are often tested for robustness during the experimental phase, an aspect that has not been fully considered in the current review.

Therefore, re-examine the 3D object detection algorithm, focusing on `accuracy, latency and robustness' as key aspects. We reclassify previous reviews with special emphasis on reclassification from a safety perception perspective. It is hoped that this work will provide new insights into future research on 3D object detection, going beyond just exploring the limitations of high accuracy.

Camera or Lidar?如何实现鲁棒的3D目标检测?最新综述!

The above is the detailed content of Choose camera or lidar? A recent review on achieving robust 3D object detection. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete