


UniPAD: Universal autonomous driving pre-training mode! Various perception tasks can be supported
Recently, the speed at which new papers are being published has been so fast that I feel like I can’t read them. It can be seen that the fusion of multi-modal large models for language and vision has become an industry consensus. This article on UniPad is more representative, with multi-modal input and a pre-trained base model of world-like models, while being easy to expand. to multiple traditional vision applications. It also solves the problem of applying the pre-training method of large language models to 3D scenes, thus providing the possibility of a unified large model of perceptual base.
UniPAD is a self-supervised learning method based on MAE and 3D rendering that can train a base model with excellent performance and then fine-tune and train downstream tasks on the model, such as depth estimation, object detection and segmentation. This study designed a unified 3D space representation method that can be easily integrated into 2D and 3D frameworks, showing greater flexibility and consistent with the positioning of the base model
Thinking and Reading While Reading Question:
What is the relationship between mask auto-encoding technology and 3D differentiable rendering technology? To put it simply: Masked autoencoding is to take advantage of Autoencoder’s self-supervised training capabilities, and rendering technology is to calculate the loss function between the generated image and the original image and conduct supervised training. So the logic is still very clear.
This article uses the base model pre-training method, and then fine-tunes the downstream detection method and segmentation method. This method can also help understand how the current large model works with downstream tasks.
Looks like is not combined with timing information. After all, NuScenes NDS of Pure Vision 50.2 is currently still weaker in comparison with timing detection methods (StreamPETR, Sparse4D, etc.). Therefore, the 4D MAE method is also worth trying. In fact, GAIA-1 has already mentioned a similar idea.
How about the calculation amount and memory usage?
Specific method:
UniPAD implicitly encodes 3D spatial information. This is mainly inspired by mask auto-encoding (MAE, VoxelMAE, etc.). This article uses A generative mask is used to complete the enhancement of voxel features, which is used to reconstruct the continuous 3D shape structure in the scene and their complex appearance features on the 2D plane.
Our experimental results fully prove the superiority of UniPAD. Compared with traditional lidar, camera and lidar-camera fusion baselines, UniPAD's NDS improves by 9.1, 7.7 and 6.9 respectively. It is worth noting that on the nuScenes validation set, our pre-training pipeline achieved an NDS of 73.2, while achieving an mIoU score of 79.4 on the 3D semantic segmentation task, achieving the best results compared with previous methods
Overall architecture:
Overall architecture. The framework takes LiDar and multi-shot images as input, and these multi-modal data are filled with zeros through the Mask Generator. The masked embedding is converted to voxel space, and rendering techniques are used to generate RGB or depth predictions in this 3D space. At this time, the original image that is not obscured by the mask can be used as generated data for supervised learning.
Mask Generator
The mask in Masked AutoEncoder is generated by Mask Generator. It can be understood as improving the representation ability and generalization ability of the model by increasing the difficulty of training. A Mask generator is introduced to differentiate between point cloud data and image data by selectively occluding certain areas. In the point cloud data, the block masking strategy is adopted; for the image data, the sparse convolution method is used, and calculations are only performed in the visible area. When the input data is masked, the subsequent encoding features will be set to 0 in the corresponding masked area and ignored in the model processing. It also provides subsequent supervised learning with information that can be used to predict the target and the corresponding Groundtruth information
Unified representation
In order to make the pre-training method applicable to various data modalities, it is important to find a unified representation. Past methods such as BEV and OCC are looking for a unified form of identification. Projecting 3D points into the image plane will lead to the loss of depth information, and merging them into the BEV bird's-eye view will miss height-related details. Therefore, this article proposes to convert both modalities into 3D volume space, which is a 3D voxel space similar to OCC
Rendering method:
Differentiable rendering technology This should be the biggest highlight of the paper according to the author. This paper uses NERF-like sampling rays to pass through multi-view images or point clouds, predict the color or depth of each 3D point through the neural network structure, and finally obtain 2D data through the path of the rays. of mapping. This can better utilize geometric or texture clues in images and improve the model's learning ability and application range.
We represent the scene as SDF (implicit signed distance function field). When the input is the 3D coordinate P of the sampling point (the corresponding depth D along the ray) and F (the feature embedding can be extracted from the volumetric representation by trilinear interpolation ), SDF can be regarded as an MLP to predict the SDF value of the sampling point. Here F can be understood as the encode code where point P is located. Then the output is obtained: N (condition the color field on the surface normal) and H (geometry feature vector). At this time, the RGB of the 3D sampling point can be obtained through an MLP with P, D, F, N, H as input. value and depth value, and then superimpose the 3D sampling points to the 2D space through rays to obtain the rendering result. The method of using Ray here is basically the same as that of Nerf.
The rendering method also needs to optimize the memory consumption, which is not listed here. However, this issue is a more critical implementation issue.
The essence of the Mask and rendering methods is to train a pre-trained model. The pre-trained model can be trained based on the predicted mask, even without subsequent branches. The subsequent work of the pre-training model generates RGB and depth predictions through different branches, and fine-tunes tasks such as target detection/semantic segmentation to achieve plug-and-play capabilities
Loss loss function:
Loss function is not complicated.
Experimental results:
# #Comparison with other recent work:
In fact, GAIA-1 is already using the Mask AutoEncoder idea in timing, but the supervision data is a whole frame of data at different times, but UniPAD is using Randomly pick out a part of the mask in the 3D space to supervise the prediction. I'm really looking forward to seeing a way to combine the two. In addition, UniPAD can be regarded as an attempt at a multi-modal large model, and it can also be regarded as a world model. Although the article does not emphasize these very much.Summary:
This article should be regarded as a relatively new Masked Autoencoder method in the 3D field. Because the MAE method is used in the pre-training stage of the base model, it supports multiple different modalities of information, so it can naturally be extended to many downstream fine-tuning tasks. This is very close to the design idea of LLM, which focuses on The pre-training stage captures multi-modal information and provides a unified basis for various tasks. This method provides new ideas and possibilities for research in the 3D field. This method not only has potential in the 3D field, but can also be extended to the 4D timing field, and can also generate a lot of new work in terms of optimizing its memory and calculation volume, providing new ideas and insights for future research. possibility.The above is the detailed content of UniPAD: Universal autonomous driving pre-training mode! Various perception tasks can be supported. For more information, please follow other related articles on the PHP Chinese website!

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

The TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

SublimeText3 English version
Recommended: Win version, supports code prompts!

Notepad++7.3.1
Easy-to-use and free code editor

Atom editor mac version download
The most popular open source editor

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.