Home > Article > Technology peripherals > Graph-DETR3D: Rethinking overlapping regions in multi-view 3D object detection
arXiv paper "Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection", June 22, work of the University of Science and Technology of China, Harbin Institute of Technology and SenseTime.
Detecting 3-D objects from multiple image views is a fundamental yet challenging task in visual scene understanding. Due to its low cost and high efficiency, multi-view 3-D object detection shows broad application prospects. However, due to the lack of depth information, it is extremely difficult to accurately detect objects through perspective in 3-D space. Recently, DETR3D introduces a new 3D-2D query paradigm for aggregating multi-view images for 3D object detection and achieves state-of-the-art performance.
Through intensive guided experiments, this paper quantifies targets located in different areas and finds that "truncated instances" (i.e., the boundary areas of each image) are the main bottleneck hindering DETR3D performance. Despite merging multiple features from two adjacent views in overlapping regions, DETR3D still suffers from insufficient feature aggregation and therefore misses the opportunity to fully improve detection performance.
In order to solve this problem, Graph-DETR3D is proposed to automatically aggregate multi-view image information through graph structure learning (GSL). A dynamic 3D map is constructed between each target query and 2-D feature map to enhance target representation, especially in boundary regions. In addition, Graph-DETR3D benefits from a new depth-invariant multi-scale training strategy, which maintains visual depth consistency by simultaneously scaling the image size and target depth.
The difference of Graph-DETR3D lies in two points, as shown in the figure: (1) aggregation module of dynamic graph features; (2) depth-invariant multi-scale training strategy. It follows the basic structure of DETR3D and consists of three components: image encoder, transformer decoder and target prediction head. Given a set of images I = {I1, I2,…,IK} (captured by N peri-view cameras), Graph-DETR3D aims to predict the location and category of the bounding box of interest. First, use an image encoder (including ResNet and FPN) to turn these images into a set of relatively L feature map-level features F. Then, a dynamic 3-D graph is constructed to extensively aggregate 2-D information through the dynamic graph feature aggregation (DGFA) module to optimize the representation of the target query. Finally, the enhanced target query is utilized to output the final prediction.
The figure shows the dynamic graph feature aggregation (DFGA) process: first construct a learnable 3-D graph for each target query, and then construct a learnable 3-D graph from the 2-D image plane Sampling characteristics. Finally, the representation of the target query is enhanced through graph connections. This interconnected message propagation scheme supports iterative refinement of graph structure construction and feature enhancement.
Multi-scale training is a commonly used data augmentation strategy in 2D and 3D object detection tasks, which is proven to be effective and low-cost inference. However, it rarely appears in vision-based 3-D inspection methods. Taking into account different input image sizes can improve the robustness of the model, while adjusting the image size and modifying the camera internal parameters to implement a common multi-scale training strategy.
An interesting phenomenon is that the final performance drops sharply. By carefully analyzing the input data, we found that simply rescaling the image leads to a perspective-ambiguity problem: when the target is resized to a larger/smaller scale, its absolute properties (i.e. the size of the target, the distance to the ego point) do not Change.
As a concrete example, the figure shows this ambiguous problem: although the absolute 3D position of the selected area in (a) and (b) is the same, the number of image pixels is different. Depth prediction networks tend to estimate depth based on the occupied area of the image. Therefore, this training pattern in the figure may confuse the depth prediction model and further deteriorate the final performance.
Recalculate depth from pixel perspective for this purpose. The pseudocode of the algorithm is as follows:
The following is the decoding operation:
The recalculated pixel size is:
Assume the scale factor r = rx = ry, then simplify to get:
The experimental results are as follows:
##Note: DI = Depth-Invariant
The above is the detailed content of Graph-DETR3D: Rethinking overlapping regions in multi-view 3D object detection. For more information, please follow other related articles on the PHP Chinese website!