Home >Technology peripherals >AI >Spread everything? 3DifFusionDet: Diffusion model enters LV fusion 3D target detection!
In recent years, the diffusion model has been very successful in generation tasks, and has naturally been extended to target detection tasks. It models target detection as starting from a noise box. (noisy boxes) to the object boxes (object boxes) denoising diffusion process. During the training phase, target boxes are diffused from ground-truth boxes to random distributions, and the model learns how to reverse this process of adding noise to ground-truth boxes. During the inference phase, the model refines a set of randomly generated target boxes into output results in a progressive manner. Compared with traditional object detection methods, which rely on a fixed set of learnable queries, 3DifFusionDet does not require learnable queries for object detection.
The 3DifFusionDet framework represents 3D target detection as a denoising diffusion process from a noisy 3D box to a target box. In this framework, ground truth boxes are trained with random distribution diffusion and the model learns the inverse noise process. During inference, the model gradually refines a set of randomly generated boxes. Under the feature alignment strategy, the progressive refinement method can make an important contribution to lidar-camera fusion. The iterative refinement process also shows great adaptability by applying the framework to various detection environments requiring different levels of accuracy and speed. KITTI is a benchmark for real traffic target recognition. A large number of experiments have been conducted on KITTI, which shows that compared with early detectors, KITTI can achieve good performance
The main contributions of 3DifFusionDet are as follows:
For 3D target detection, Camera and LiDAR are two complementary sensor types. LiDAR sensors focus on 3D localization and provide rich information about 3D structures, while Camera provides color information from which rich semantic features can be derived. Many efforts have been made to accurately detect 3D objects by fusing data from cameras and LiDAR. State-of-the-art methods are mainly based on LiDAR-based 3D object detectors and strive to incorporate image information into various stages of the LiDAR detection process, as the performance of LiDAR-based detection methods is significantly better than that of Camera-based methods. Due to the complexity of lidar-based and camera-based detection systems, combining the two modes will inevitably increase computational costs and inference time delays. Therefore, the problem of effectively fusing multimodal information remains.
The diffusion model is a generative model that gradually deconstructs the observed data by introducing noise and restores the original data by reversing the process . Diffusion models and denoising score matching are connected through the denoising diffusion probabilistic model (Ho, Jain, and Abbeel 2020a), which has recently sparked interest in computer vision applications. It has been applied in many fields, such as graph generation, language understanding, robust learning and temporal data modeling.
Diffusion models have achieved great success in image generation and synthesis. Some pioneer works adopt diffusion models for image segmentation tasks. Compared to these fields, their potential for object detection has not yet been fully exploited. Previous approaches to object detection using diffusion models have been limited to 2D bounding boxes. Compared with 2D detection, 3D detection provides richer target space information and can achieve accurate depth perception and volume understanding, which is crucial for applications such as autonomous driving, where it is necessary to identify the precise distance of surrounding vehicles. and direction are important aspects for applications such as autonomous driving.
Figure 1 shows the overall architecture of 3DifFusionDet. It accepts multimodal inputs including RGB images and point clouds. Dividing the entire model into feature extraction and feature decoding parts, as with DiffusionDet, it would be difficult to directly apply to the original 3D features in each iteration step. The feature extraction part is run only once to extract deep feature representations from the original input X, while the feature decoding component is conditioned on this deep feature and trained to gradually draw box predictions from noisy boxes. In order to take full advantage of the complementary information provided by the two modalities, the encoder and decoder of each modality are separated. Furthermore, the image decoder and point cloud decoder are trained separately to refine 2D and 3D features using a diffusion model to generate noise boxes and respectively. As for the connection of these two feature branches, simply connecting them will cause information shearing, resulting in performance degradation. To this end, a multi-head cross-attention mechanism is introduced to deeply align these features. These aligned features are input to the detection head to predict the final true value without generating noise.
For the point cloud encoder, voxel-based methods are used for extraction and sparse-based methods are used for processing. Voxel-based methods convert LiDAR points into voxels. Compared with other series of point feature extraction methods (such as point-based methods), these methods discretize point clouds into equally spaced 3D grids, reducing memory requirements while retaining the original 3D shape information as much as possible. The sparsity-based processing method further helps the network improve computational efficiency. These benefits balance the relatively high computational requirements of diffusion models.
Compared with 2D features, 3D features contain extra dimensions, making learning more challenging. With this in mind, in addition to extracting features from the original modality, a fusion path is added that adds the extracted image features as another input to the point encoder, facilitating information exchange and leveraging learning from more diverse sources . A PointFusion strategy is employed, where points from the LiDAR sensor are projected onto the image plane. The concatenation of image features and corresponding points is then jointly processed by the VoxelNet architecture.
Feature decoder. The extracted image features and extracted point features are used as inputs to the corresponding image and point decoders. Each decoder also combines input from a uniquely created noise box or and learns to refine 2D and 3D features respectively, in addition to the corresponding extracted features.
Inspired by Sparse RCNN, the image decoder receives input from a collection of 2D proposal boxes and crops the RoI features from the feature map created by the image encoder. The point decoder receives input from a collection of 3D proposal boxes and crops the RoI features from the feature map created by the image encoder. For the point decoder, the input is a set of 3D proposal boxes that are used to crop 3D RoI features from the feature map generated by the point encoder
Cross Attention Module. After decoding the two feature branches, a way to combine them is needed. A straightforward approach is to simply connect the two feature branches by connecting them. This method appears to be too rough and may cause the model to suffer from information shearing, leading to performance degradation. Therefore, a multi-head cross-attention mechanism is introduced to deeply align and refine these features, as shown in Figure 1. Specifically, the output of the point decoder is treated as a source of k and v, while the output of the image decoder is projected onto q.
Experiments were conducted on the KITTI 3D object detection benchmark. Following the standard KITTI evaluation protocol for measuring detection performance (IoU = 0.7), Table 1 shows the mean precision (mAP) score of the 3DifFusionDet method compared to the state-of-the-art methods on the KITTI validation set. Performance is reported, following [diffusionDet, difficileist] and bolding the two best performing models for each task. According to the results in Table 1, the method of this study shows significant performance improvement compared to the baseline. When D=4, the method is able to surpass most baseline models in shorter inference time. When further increasing D to 8, the best performance is achieved among all models although the inference time is longer. This flexibility reveals that this method has a wide range of potential applications . To design a 3D object detector from Camera and LiDAR using diffusion models, the most straightforward approach should be to directly apply the generated noisy 3D boxes as input to fused 3D features. However, this approach may suffer from information shearing, resulting in performance degradation, as shown in Table 2. Using this, in addition to putting the point cloud RoIAlign under the encoded 3D features, we also create a second branch that puts the image RoIAlign under the encoded 2D features. The significantly improved performance suggests that the complementary information provided by both modes can be better exploited.
We will then analyze the impact of different fusion strategies: given the learned 2D and 3D representation features, how can we combine them more effectively. Compared with 2D features, 3D features have an extra dimension, which makes the learning process more challenging. We add an information flow path from image features to point features by projecting points from the LiDAR sensor onto image features and concatenating them with corresponding points to be jointly processed. This is the VoxelNet architecture. As can be seen from Table 3, this fusion strategy has great benefits for detection accuracy
The other part that needs to be fused is the connection of the two feature branches after decoding. Here, a multi-head cross-attention mechanism is applied to deeply align and refine these features. In addition to this, more direct methods such as the use of concatenation operations, summation operations, direct product operations, and the use of multilayer perceptrons (MLP) have also been studied. The results are shown in Table 4. Among them, the cross-attention mechanism shows the best performance, with almost the same training and inference speed. Study the trade-off between accuracy and inference speed. The impact of choosing different proposal boxes and D is shown by comparing 3D detection accuracy and frames per second (FPS). The number of proposal boxes is chosen from 100, 300, while D is chosen from 1, 4, 8. The running time is evaluated on a single NVIDIA RTX A6000 GPU with a batch size of 1. It was found that increasing the number of proposal boxes from 100 to 300 resulted in significant accuracy gains with negligible latency costs (1.3 FPS vs. 1.2 FPS). On the other hand, better detection accuracy leads to longer inference time. When changing D from 1 to 8, the 3D detection accuracy increases from sharp (Easy: 87.1 mAP to 90.5 mAP) to relatively slowly (Easy: 90.5 AP to 91.3 mAP), while FPS keeps decreasing. Case Research and Future Work Based on its unique properties, this article discusses the potential uses of 3DifFusionDet. Generally speaking, accurate, robust and real-time inference are three requirements for object detection tasks. In the field of perception for self-driving cars, perception models are particularly sensitive to real-time requirements, considering that cars traveling at high speeds need to spend extra time and distance to slow down or change direction due to inertia. More importantly, in order to ensure a comfortable ride experience, the car should drive as smoothly as possible with the smallest absolute value of acceleration under the premise of safety. One of its main advantages is a smoother ride experience compared to other similar self-driving car products. To do this, self-driving cars should start reacting quickly, whether accelerating, decelerating or turning. The faster the car responds, the more room it has for subsequent maneuvers and adjustments. This is more important than first obtaining the most precise classification or location of the detected target: when the car starts to respond, there is still time and distance to adjust the way it behaves, which can be used to make further decisions in a more precise way. Extrapolated, the results are then used to fine-tune the car's driving behavior. The rewritten content is as follows: According to the results in Table 4, when the inference step size is small, our 3DifFusionDet model can perform inference quickly and obtain relatively high accuracy. This initial perception is accurate enough to allow the self-driving car to develop new responses. As the number of inference steps increases, we are able to generate more accurate object detections and further fine-tune our responses. This progressive approach to detection is ideally suited to our task. Furthermore, since our model can adjust the number of proposal boxes during inference, we can leverage the prior information obtained from small steps to optimize the number of real-time proposal boxes. According to the results in Table 4, the performance under different a priori proposal frames is also different. Therefore, developing such adaptive detectors is a promising work In addition to self-driving cars, our model essentially matches any realistic scenario that requires short inference time in a continuous reaction space, especially In scenarios where the detector moves based on detection results. Benefiting from the properties of the diffusion model, 3DifFusionDet can quickly find an almost accurate real-space region of interest, triggering the machine to start new operations and self-optimization. Subsequent higher-precision perceptrons further fine-tune the machine's operation. In order to deploy models into these motion detectors, one open question is strategies for combining inference information between earlier inferences at larger steps and more recent inferences at smaller steps, and this is another open question. This article introduces a new 3D object detector called 3DifFusionDet, which has powerful LiDAR and camera fusion capabilities. Formulating 3D object detection as a generative denoising process, this is the first work to apply diffusion models to 3D object detection. In the context of generating a denoising process framework, this study explores the most effective camera lidar fusion alignment strategies and proposes a fusion alignment strategy to fully exploit the complementary information provided by both modes. Compared with mature detectors, 3DifFusionDet performs well, demonstrating the broad application prospects of diffusion models in object detection tasks. Its powerful learning results and flexible reasoning model make it have broad potential uses Original link: https://mp.weixin.qq.com/s/0Fya4RYelNUU5OdAQp9DVA Summarize
The above is the detailed content of Spread everything? 3DifFusionDet: Diffusion model enters LV fusion 3D target detection!. For more information, please follow other related articles on the PHP Chinese website!