New title: Sparse4D v3: Advancing end-to-end 3D detection and tracking technology
Paper link: https://arxiv.org/pdf/2311.11722.pdf
Needs to be rewritten The content is: Code link: https://github.com/linxuewu/Sparse4D
Rewritten content: The author’s affiliation is Horizon Company
Thesis idea:
In the autonomous driving perception system, 3D detection and tracking are two basic tasks. This article takes a deeper look into this area based on the Sparse4D framework. This article introduces two auxiliary training tasks (temporal instance denoising-Temporal Instance Denoising and quality estimation-Quality Estimation), and proposes decoupled attention (decoupled attention) for structural improvement, thereby significantly improving detection performance. Furthermore, this paper extends the detector to the tracker using a simple method that assigns instance IDs during inference, further highlighting the advantages of query-based algorithms. Extensive experiments on the nuScenes benchmark validate the effectiveness of the proposed improvements. Using ResNet50 as the backbone, mAP, NDS and AMOTA increased by 3.0%, 2.2% and 7.6% respectively, reaching 46.9%, 56.1% and 49.0% respectively. The best model in this article achieved 71.9% NDS and 67.7% AMOTA on the nuScenes test set
Main contribution:
Sparse4D-v3 is a powerful 3D perception framework , which proposes three effective strategies: temporal instance denoising, quality estimation, and decoupled attention
This article extends Sparse4D into an end-to-end tracking model.
This paper demonstrates the effectiveness of nuScenes improvements, achieving state-of-the-art performance in detection and tracking tasks.
Network Design:
First, it is observed that sparse algorithms face greater challenges in convergence compared to dense algorithms, thus affecting the final performance. This problem has been well studied in the field of 2D detection [17, 48, 53], mainly because sparse algorithms use one-to-one positive sample matching. This matching method is unstable in the early stages of training, and compared with one-to-many matching, the number of positive samples is limited, thus reducing the efficiency of decoder training. Furthermore, Sparse4D uses sparse feature sampling instead of global cross-attention, which further hinders the convergence of the encoder due to the scarcity of positive samples. In Sparse4Dv2, dense deep supervision is introduced to partially alleviate these convergence issues faced by image encoders. The main goal of this paper is to enhance model performance by focusing on the stability of decoder training. This paper uses the denoising task as auxiliary supervision and extends the denoising technology from 2D single frame detection to 3D temporal detection. This not only ensures stable positive sample matching, but also significantly increases the number of positive samples. In addition, this paper also introduces a quality assessment task as auxiliary supervision. This makes the output confidence score more reasonable, improves the accuracy of detection result ranking, and thus obtains higher evaluation indicators. In addition, this article improves the structure of the instance self-attention and temporal cross-attention modules in Sparse4D, and introduces a decoupled attention mechanism aimed at reducing feature interference in the attention weight calculation process. By using anchor embeddings and instance features as inputs to the attention calculation, instances with outliers in the attention weights can be reduced. This can more accurately reflect the correlation between target features, thereby achieving correct feature aggregation. This paper uses connections instead of attention mechanisms to significantly reduce this error. This augmentation method has similarities with conditional DETR, but the key difference is that this paper emphasizes attention between queries, while conditional DETR focuses on cross-attention between queries and image features. In addition, this article also involves a unique encoding method
In order to improve the end-to-end capabilities of the perception system, this article studies the method of integrating 3D multi-target tracking tasks into the Sparse4D framework to directly output the target's motion trajectory. Unlike detection-based tracking methods, this paper integrates all tracking functions into the detector by eliminating the need for data association and filtering. Furthermore, unlike existing joint detection and tracking methods, our tracker does not require modification or adjustment of the loss function during training. It does not require providing ground truth IDs, but implements predefined instance-to-track regression. The tracking implementation of this article fully integrates the detector and the tracker, without modifying the training process of the detector, and without additional fine-tuning
This is Figure 1 about the overview of the Sparse4D framework , the input is a multi-view video, and the output is the perceptual result of all frames
Figure 2: Inference efficiency (FPS) - perceptual performance (FPS) on the nuScenes validation data set of different algorithms mAP).
Figure 3: Visualization of attention weights in instance self-attention: 1) The first row shows the attention weights in ordinary self-attention, where the pedestrian in the red circle is shown to be in line with the target vehicle (green box) unexpected correlation. 2) The second row shows the attention weight in decoupled attention, which effectively solves this problem.
The fourth picture shows an example of time series instance denoising. During the training phase, instances consist of two parts: learnable and noisy. Noise instances are composed of temporal and non-temporal elements. This paper adopts a pre-matching method to allocate positive and negative samples, that is, matching anchors with ground truth, while learnable instances are matched with predictions and ground truth. During the testing phase, only green blocks remain. In order to prevent features from spreading between groups, an Attention mask is used. Gray indicates that there is no attention between queries and keys, and green indicates the opposite.
Please see Figure 5: Anchor points Architectures for encoders and attention. This paper independently encodes high-dimensional features of multiple components of anchors and then concatenates them. This approach reduces computational and parameter overhead compared to the original Sparse4D. E and F represent anchor embedding and instance features respectively
Experimental results:
Summary:
This article first proposes a method to enhance the detection performance of Sparse4D. This enhancement mainly includes three aspects: temporal instance denoising, quality estimation and decoupled attention. Subsequently, the article explains the process of extending Sparse4D into an end-to-end tracking model. This article's experiments on nuScenes show that these enhancements significantly improve performance, placing Sparse4Dv3 at the forefront of the field.
Citation:
Lin, X., Pei, Z., Lin, T., Huang, L., & Su, Z. (2023). Sparse4D v3: Advancing End-to-End 3D Detection and Tracking. ArXiv. /abs/2311.11722
The above is the detailed content of Sparse4D v3 is here! Advancing end-to-end 3D detection and tracking. For more information, please follow other related articles on the PHP Chinese website!

在当下的序列建模任务上,Transformer可谓是最强大的神经网络架构,并且经过预训练的Transformer模型可以将prompt作为条件或上下文学习(in-context learning)适应不同的下游任务。大型预训练Transformer模型的泛化能力已经在多个领域得到验证,如文本补全、语言理解、图像生成等等。从去年开始,已经有相关工作证明,通过将离线强化学习(offline RL)视为一个序列预测问题,那么模型就可以从离线数据中学习策略。但目前的方法要么是从不包含学习的数据中学习策略

优化器在大语言模型的训练中占据了大量内存资源。现在有一种新的优化方式,在性能保持不变的情况下将内存消耗降低了一半。该成果由新加坡国立大学打造,在ACL会议上获得了杰出论文奖,并已经投入了实际应用。图片随着大语言模型不断增加的参数量,训练时的内存消耗问题更为严峻。研究团队提出了CAME优化器,在减少内存消耗的同时,拥有与Adam相同的性能。图片CAME优化器在多个常用的大规模语言模型的预训练上取得了相同甚至超越Adam优化器的训练表现,并对大batch预训练场景显示出更强的鲁棒性。进一步地,通过C

论文链接:https://arxiv.org/pdf/2207.09519.pdf代码链接:https://github.com/gaopengcuhk/Tip-Adapter一.研究背景对比性图像语言预训练模型(CLIP)在近期展现出了强大的视觉领域迁移能力,可以在一个全新的下游数据集上进行 zero-shot 图像识别。为了进一步提升 CLIP 的迁移性能,现有方法使用了 few-shot 的设置,例如 CoOp 和 CLIP-Adapter,即提供了少量下游数据集的训练数据,使得 CLIP

本周,芯片创业公司Cerebras宣布了一个里程碑式的新进展:在单个计算设备中训练了超过百亿参数的NLP(自然语言处理)人工智能模型。由Cerebras训练的AI模型体量达到了前所未有的200亿参数,所有这些都无需横跨多个加速器扩展工作负载。这项工作足以满足目前网络上最火的文本到图像AI生成模型——OpenAI的120亿参数大模型DALL-E。Cerebras新工作中最重要的一点是对基础设施和软件复杂性的要求降低了。这家公司提供的芯片WaferScaleEngine-

说到神经网络训练,大家的第一印象都是 GPU + 服务器 + 云平台。传统的训练由于其巨大的内存开销,往往是云端进行训练而边缘平台仅负责推理。然而,这样的设计使得 AI 模型很难适应新的数据:毕竟现实世界是一个动态的,变化的,发展的场景,一次训练怎么能覆盖所有场景呢?为了使得模型能够不断的适应新数据,我们能否在边缘进行训练(on-device training),使设备不断的自我学习?在这项工作中,我们仅用了不到 256KB 内存就实现了设备上的训练,开销不到 PyTorch 的 1/1000,

本文介绍被机器学习顶级国际会议AAAI2023接收的论文《ImprovingTrainingandInferenceofFaceRecognitionModelsviaRandomTemperatureScaling》。论文创新性地从概率视角出发,对分类损失函数中的温度调节参数和分类不确定度的内在关系进行分析,揭示了分类损失函数的温度调节因子是服从Gumbel分布的不确定度变量的尺度系数。从而提出一个新的被叫做RTS的训练框架对特征抽取的可靠性进行建模。基于RTS

多样高质的三维场景生成结果论文地址:https://arxiv.org/abs/2304.12670项目主页:http://weiyuli.xyz/Sin3DGen/引言使用人工智能辅助内容生成(AIGC)在图像生成领域涌现出大量的工作,从早期的变分自编码器(VAE),到生成对抗网络(GAN),再到最近大红大紫的扩散模型(DiffusionModel),模型的生成能力飞速提升。以StableDiffusion,Midjourney等为代表的模型在生成具有高真实感图像方面取得了前所未有的成果。同时

本文经AI新媒体量子位(公众号ID:QbitAI)授权转载,转载请联系出处。AI绘画侵权,实锤了!最新研究表明,扩散模型会牢牢记住训练集中的样本,并在生成时“依葫芦画瓢”。也就是说,像Stable Diffusion生成的AI画作里,每一笔背后都可能隐藏着一次侵权事件。不仅如此,经过研究对比,扩散模型从训练样本中“抄袭”的能力是GAN的2倍,且生成效果越好的扩散模型,记住训练样本的能力越强。这项研究来自Google、DeepMind和UC伯克利组成的团队。论文中还有另一个糟糕的消息,那就是针对这


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Dreamweaver CS6
Visual web development tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),
