Home > Article > Technology peripherals > In-depth discussion on the application of multi-modal fusion perception algorithm in autonomous driving
Please contact the source to obtain permission to reprint this article. This article was published by the Autonomous Driving Heart public account
More Modal sensor fusion means complementary, stable and safe information, and has long been an important part of autonomous driving perception. However, insufficient information utilization, noise in the original data, and misalignment between various sensors (such as out-of-synchronization of timestamps) have all resulted in limited fusion performance. This article comprehensively surveys existing multi-modal autonomous driving perception algorithms. Sensors include LiDAR and cameras, focusing on target detection and semantic segmentation, and analyzes more than 50 documents. Different from the traditional classification method of fusion algorithms, this paper classifies this field into two major categories and four sub-categories based on the different fusion stages. In addition, this article analyzes existing problems in the current field and provides reference for future research directions.
This is because the single-modal perception algorithm has inherent flaws. For example, lidar is generally installed higher than the camera. In complex real-life driving scenarios, objects may be blocked in the front-view camera. In this case, it is possible to use lidar to capture the missing target. However, due to the limitations of the mechanical structure, LiDAR has different resolutions at different distances and is easily affected by extremely severe weather, such as heavy rain. Although both sensors can do very well when used alone, from a future perspective, the complementary information of LiDAR and cameras will make autonomous driving safer at the perception level.
Recently, multi-modal perception algorithms for autonomous driving have made great progress. These advances include cross-modal feature representation, more reliable modal sensors, and more complex and stable multi-modal fusion algorithms and techniques. However, only a few reviews [15, 81] focus on the methodology itself of multimodal fusion, and most of the literature is classified according to traditional classification rules, namely pre-fusion, deep (feature) fusion and post-fusion, and mainly focuses on The stage of feature fusion in the algorithm, whether it is data level, feature level or proposal level. There are two problems with this classification rule: first, the feature representation of each level is not clearly defined; second, it treats the two branches of lidar and camera from a symmetrical perspective, thus blurring the relationship between feature fusion and feature fusion in the LiDAR branch. The case of data-level feature fusion in the camera branch. In summary, although the traditional classification method is intuitive, it is no longer suitable for the development of current multi-modal fusion algorithms, which to a certain extent hinders researchers from conducting research and analysis from a systematic perspective
Common perception tasks include target detection, semantic segmentation, depth completion and prediction, etc. This article focuses on detection and segmentation, such as the detection of obstacles, traffic lights, traffic signs, and segmentation of lane lines and freespace. The autonomous driving perception task is shown in the figure below:
Common public data sets mainly include KITTI, Waymo and nuScenes. The following figure summarizes the data sets related to autonomous driving perception and their Features
Multimodal fusion is inseparable from the data expression form. The data representation of the image branch is relatively simple. Generally speaking, it refers to RGB format or grayscale image, but the lidar branch is highly dependent on data format. Different data formats derive completely different downstream model designs. In summary, it includes three general directions: point-based, volume-based. Voxel and point cloud representation based on 2D mapping.
Traditional classification methods divide multi-modal fusion into the following three types:
The article uses the classification method in the figure below, which is generally divided into strong fusion and weak fusion. Strong fusion can be subdivided into front fusion, deep fusion, asymmetric fusion and post-fusion
This article uses KITTI’s 3D detection task and BEV detection task to horizontally compare the performance of various multi-modal fusion algorithms. The following figure is the result of the BEV detection test set:
The following is an example of the results of the 3D detection test set:
According to the different combination stages represented by lidar and camera data, this article subdivides strong fusion into: pre-fusion, Deep fusion, asymmetric fusion and post-fusion. As shown in the figure above, it can be seen that each sub-module of strong fusion is highly dependent on lidar point cloud rather than camera data.
Different from the traditional data-level fusion definition, the latter is a direct fusion of each modality data at the original data level through spatial alignment and projection In this approach, early fusion fuses LiDAR data at the data level and camera data at the data level or at the feature level. An example of early fusion could be the model in Figure 4. Rewritten content: Different from the traditional data-level fusion definition, which is a method to directly fuse each modality data through spatial alignment and projection at the original data level. Early fusion refers to the fusion of LiDAR data and camera data or feature-level data at the data level. The model in Figure 4 is an example of early fusion
Different from the pre-fusion defined by traditional classification methods, the pre-fusion defined in this article refers to the direct fusion of each modal data through spatial alignment and projection at the original data level. Method, the former fusion refers to the fusion of LiDAR data at the data level, and the fusion of image data at the data level or feature level. The schematic diagram is as follows:
In the LiDAR branch, point cloud There are many expression methods, such as reflection map, voxelized tensor, front view/distance view/BEV view, pseudo point cloud, etc. Although these data have different intrinsic characteristics in different backbone networks, except for pseudo point clouds [79], most of the data are generated through certain rule processing. In addition, compared with feature space embedding, these LiDAR data are highly interpretable and can be directly visualized. In the image branch, the data-level definition in the strict sense refers to RGB or gray image, but this definition lacks universality and rationality. Therefore, this paper expands the data-level definition of image data in the pre-fusion stage to include data-level and feature-level data. It is worth mentioning that this article also regards the prediction results of semantic segmentation as a type of pre-fusion (image feature level). On the one hand, it is helpful for 3D target detection, and on the other hand, it is because of the "target level" of semantic segmentation. Features are different from the final target-level proposal of the entire task
Deep Fusionand strong fusion is that the weak fusion method does not directly fuse data, features or targets from multi-modal branches, but processes the data in other forms . The figure below shows the basic framework of the weak fusion algorithm. Methods based on weak fusion usually use certain rule-based methods to utilize data from one modality as a supervision signal to guide the interaction of another modality. For example, the 2D proposal from CNN in the image branch may cause truncation in the original LiDAR point cloud, and weak fusion directly inputs the original LiDAR point cloud into the LiDAR backbone to output the final proposal.
There are also some works that do not belong to any of the above paradigms because they are within the framework of model design A variety of fusion methods are used in [39], which combines deep fusion and post-fusion, and [77] combines front-front fusion. These methods are not the mainstream methods of fusion algorithm design, and are classified into other fusion methods in this article.
In recent years, multi-modal fusion methods for autonomous driving perception tasks have made rapid progress, starting from higher-level features representation to more complex deep learning models. However, there are still some outstanding issues that need to be resolved. This article summarizes several possible future improvement directions as follows.
The current fusion model has problems of misalignment and information loss [13, 67, 98]. In addition, flat fusion operations also hinder further improvements in perceptual task performance. The summary is as follows:
The forward-looking single-frame image is a typical scenario for autonomous driving perception tasks. However, most frameworks can only utilize limited information and do not design auxiliary tasks in detail to facilitate the understanding of driving scenarios. The summary is as follows:
Real-world scenarios and sensor height can affect domain bias and resolution. These shortcomings will hinder the large-scale training and real-time operation of deep learning models for autonomous driving
[1] https://zhuanlan.zhihu.com/p/470588787
[2] Multi-modal Sensor Fusion for Auto Driving Perception: A Survey
Original link: https://mp.weixin.qq.com/s/usAQRL18vww9YwMXRvEwLw
The above is the detailed content of In-depth discussion on the application of multi-modal fusion perception algorithm in autonomous driving. For more information, please follow other related articles on the PHP Chinese website!