Sparse4D v3 is here! Advancing end-to-end 3D detection and tracking
New title: Sparse4D v3: Advancing end-to-end 3D detection and tracking technology
Paper link: https://arxiv.org/pdf/2311.11722.pdf
Needs to be rewritten The content is: Code link: https://github.com/linxuewu/Sparse4D
Rewritten content: The author’s affiliation is Horizon Company
Thesis idea:
In the autonomous driving perception system, 3D detection and tracking are two basic tasks. This article takes a deeper look into this area based on the Sparse4D framework. This article introduces two auxiliary training tasks (temporal instance denoising-Temporal Instance Denoising and quality estimation-Quality Estimation), and proposes decoupled attention (decoupled attention) for structural improvement, thereby significantly improving detection performance. Furthermore, this paper extends the detector to the tracker using a simple method that assigns instance IDs during inference, further highlighting the advantages of query-based algorithms. Extensive experiments on the nuScenes benchmark validate the effectiveness of the proposed improvements. Using ResNet50 as the backbone, mAP, NDS and AMOTA increased by 3.0%, 2.2% and 7.6% respectively, reaching 46.9%, 56.1% and 49.0% respectively. The best model in this article achieved 71.9% NDS and 67.7% AMOTA on the nuScenes test set
Main contribution:
Sparse4D-v3 is a powerful 3D perception framework , which proposes three effective strategies: temporal instance denoising, quality estimation, and decoupled attention
This article extends Sparse4D into an end-to-end tracking model.
This paper demonstrates the effectiveness of nuScenes improvements, achieving state-of-the-art performance in detection and tracking tasks.
Network Design:
First, it is observed that sparse algorithms face greater challenges in convergence compared to dense algorithms, thus affecting the final performance. This problem has been well studied in the field of 2D detection [17, 48, 53], mainly because sparse algorithms use one-to-one positive sample matching. This matching method is unstable in the early stages of training, and compared with one-to-many matching, the number of positive samples is limited, thus reducing the efficiency of decoder training. Furthermore, Sparse4D uses sparse feature sampling instead of global cross-attention, which further hinders the convergence of the encoder due to the scarcity of positive samples. In Sparse4Dv2, dense deep supervision is introduced to partially alleviate these convergence issues faced by image encoders. The main goal of this paper is to enhance model performance by focusing on the stability of decoder training. This paper uses the denoising task as auxiliary supervision and extends the denoising technology from 2D single frame detection to 3D temporal detection. This not only ensures stable positive sample matching, but also significantly increases the number of positive samples. In addition, this paper also introduces a quality assessment task as auxiliary supervision. This makes the output confidence score more reasonable, improves the accuracy of detection result ranking, and thus obtains higher evaluation indicators. In addition, this article improves the structure of the instance self-attention and temporal cross-attention modules in Sparse4D, and introduces a decoupled attention mechanism aimed at reducing feature interference in the attention weight calculation process. By using anchor embeddings and instance features as inputs to the attention calculation, instances with outliers in the attention weights can be reduced. This can more accurately reflect the correlation between target features, thereby achieving correct feature aggregation. This paper uses connections instead of attention mechanisms to significantly reduce this error. This augmentation method has similarities with conditional DETR, but the key difference is that this paper emphasizes attention between queries, while conditional DETR focuses on cross-attention between queries and image features. In addition, this article also involves a unique encoding method
In order to improve the end-to-end capabilities of the perception system, this article studies the method of integrating 3D multi-target tracking tasks into the Sparse4D framework to directly output the target's motion trajectory. Unlike detection-based tracking methods, this paper integrates all tracking functions into the detector by eliminating the need for data association and filtering. Furthermore, unlike existing joint detection and tracking methods, our tracker does not require modification or adjustment of the loss function during training. It does not require providing ground truth IDs, but implements predefined instance-to-track regression. The tracking implementation of this article fully integrates the detector and the tracker, without modifying the training process of the detector, and without additional fine-tuning
This is Figure 1 about the overview of the Sparse4D framework , the input is a multi-view video, and the output is the perceptual result of all frames
Figure 2: Inference efficiency (FPS) - perceptual performance (FPS) on the nuScenes validation data set of different algorithms mAP).
Figure 3: Visualization of attention weights in instance self-attention: 1) The first row shows the attention weights in ordinary self-attention, where the pedestrian in the red circle is shown to be in line with the target vehicle (green box) unexpected correlation. 2) The second row shows the attention weight in decoupled attention, which effectively solves this problem.
The fourth picture shows an example of time series instance denoising. During the training phase, instances consist of two parts: learnable and noisy. Noise instances are composed of temporal and non-temporal elements. This paper adopts a pre-matching method to allocate positive and negative samples, that is, matching anchors with ground truth, while learnable instances are matched with predictions and ground truth. During the testing phase, only green blocks remain. In order to prevent features from spreading between groups, an Attention mask is used. Gray indicates that there is no attention between queries and keys, and green indicates the opposite.
Please see Figure 5: Anchor points Architectures for encoders and attention. This paper independently encodes high-dimensional features of multiple components of anchors and then concatenates them. This approach reduces computational and parameter overhead compared to the original Sparse4D. E and F represent anchor embedding and instance features respectively
Experimental results:
Summary:
This article first proposes a method to enhance the detection performance of Sparse4D. This enhancement mainly includes three aspects: temporal instance denoising, quality estimation and decoupled attention. Subsequently, the article explains the process of extending Sparse4D into an end-to-end tracking model. This article's experiments on nuScenes show that these enhancements significantly improve performance, placing Sparse4Dv3 at the forefront of the field.
Citation:
Lin, X., Pei, Z., Lin, T., Huang, L., & Su, Z. (2023). Sparse4D v3: Advancing End-to-End 3D Detection and Tracking. ArXiv. /abs/2311.11722
The above is the detailed content of Sparse4D v3 is here! Advancing end-to-end 3D detection and tracking. For more information, please follow other related articles on the PHP Chinese website!

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

SublimeText3 Linux new version
SublimeText3 Linux latest version