Home >Technology peripherals >AI >Used for lidar point cloud self-supervised pre-training SOTA!
masked autoencoding has become a successful pre-training paradigm for Transformer models of text, images, and most recently point clouds. Raw car datasets are suitable for self-supervised pre-training because they are generally less expensive to collect than annotation for tasks such as 3D object detection (OD). However, the development of masked autoencoders for point clouds has only focused on synthetic and indoor data. Therefore, existing methods have tailored their representations and models into small and dense point clouds with uniform point density. In this work, we investigate masked autoencoding of point clouds in automotive settings, which are sparse and whose density can vary significantly between different objects in the same scene. . To this end, this paper proposes Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representation. This paper pre-trains a Transformer-based 3D object detector backbone to reconstruct masked voxels and distinguish empty voxels from non-empty voxels. Our method improves 3D OD performance of 1.75 mAP and 1.05 NDS on the challenging nuScenes dataset. Furthermore, we show that by using Voxel-MAE for pre-training, we require only 40% annotated data to outperform the equivalent data with random initialization.
This paper proposes Voxel-MAE (a method of deploying MAE-style self-supervised pre-training on voxelized point clouds) , and evaluated it on the large automotive point cloud dataset nuScenes. The method in this article is the first self-supervised pre-training scheme using the automotive point cloud Transformer backbone.
We tailor our method for voxel representation and use a unique set of reconstruction tasks to capture the characteristics of voxelized point clouds.
This article proves that our method is data efficient and reduces the need for annotated data. With pre-training, this paper outperforms fully supervised data when using only 40% of the annotated data.
Additionally, this paper finds that Voxel-MAE improves the performance of Transformer-based detectors by 1.75 percentage points in mAP and 1.05 percentage points in NDS, compared with existing self-supervised methods. , its performance is improved by 2 times.
The purpose of this work is to extend MAE-style pre-training to voxelized point clouds. The core idea is still to use an encoder to create a rich latent representation from partial observations of the input, and then use a decoder to reconstruct the original input, as shown in Figure 2. After pre-training, the encoder is used as the backbone of the 3D object detector. However, due to fundamental differences between images and point clouds, some modifications are required for efficient training of Voxel-MAE.
Figure 2: Voxel-MAE method of this article. First, the point cloud is voxelized with a fixed voxel size. Voxel sizes in the figures have been exaggerated for visualization purposes. Before training, a large portion (70%) of non-empty voxels are randomly masked. The encoder is then applied only to visible voxels, embedding these voxels using dynamic voxel features embedding [46]. Masked non-empty voxels and randomly selected empty voxels are embedded using the same learnable mask tokens. The decoder then processes the sequence of mask tokens and the encoded sequence of visible voxels to reconstruct the masked point cloud and distinguish empty voxels from non-empty voxels. After pre-training, the decoder is discarded and the encoder is applied to the unmasked point cloud.
Figure 1: MAE (left) divides the image into fixed-size non-overlapping patches. Existing masked point modeling methods (middle) create a fixed number of point cloud patches by using farthest point sampling and k-nearest neighbors. Our method (right) uses non-overlapping voxels and a dynamic number of points.
##
Hess G, Jaxing J, Svensson E, et al. Masked autoencoder for self-supervised pre-training on lidar point clouds[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 350-359.
The above is the detailed content of Used for lidar point cloud self-supervised pre-training SOTA!. For more information, please follow other related articles on the PHP Chinese website!