Home >Technology peripherals >AI >NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving
The paper "NeuRAD: Neural Rendering for Autonomous Driving" comes from Zenseact, Chalmers University of Technology, Linkoping University and Lund University.
Neural Radiation Fields (NeRF) are becoming increasingly popular in the autonomous driving (AD) community. Recent methods have shown the potential of NeRFs in closed-loop simulations, AD system testing, and training data augmentation techniques. However, existing methods often require long training time, intensive semantic supervision, and lack generalizability. This in turn hinders the large-scale application of NeRF in AD. This paper proposes NeuRAD, a robust new view synthesis method for dynamic AD data. The method features a simple network design, sensor modeling including cameras and lidar (including rolling shutter, beam divergence and light fall), and works on multiple data sets out of the box.
As shown in the figure: NeuRAD is a neural rendering method tailored for dynamic car scenes. The posture of the own vehicle and other road users can be changed, and participants can be added and/or removed freely. These features make NeuRAD suitable as the basis for components such as sensor-realistic closed-loop simulators or powerful data augmentation engines.
The goal of this paper is to learn a representation from which real sensor data can be generated, which can change the vehicle platform, the actor's posture, or both . It is assumed that there is access to data collected by the mobile platform, consisting of set camera images and lidar point clouds, as well as estimates of the size and pose of any mobile actor. For practicality, the method needs to perform well in terms of reconstruction error on major automotive datasets while keeping training and inference time to a minimum.
The figure is an overview of the method proposed in this article NeuRAD: learning a static and dynamic joint neural feature field for automotive scenes, distinguished by actor-aware hash coding. Points falling within the actor's bounding box are converted to actor local coordinates and used together with the actor index to query the 4D hash grid. The volume-rendered light-level features are decoded into RGB values using an upsampling CNN and into ray fall probabilities and intensities using an MLP.
Based on the work of new view synthesis [4, 47], the author uses neural feature fields (NFF), generalizations of NeRFs [25] and similar methods [23] Model the world.
In order to render an image, a set of camera rays need to be volume rendered to generate a feature map F. As described in the paper [47], a convolutional neural network (CNN) is then used to render the final image. In practical applications, the resolution of feature maps is low and upsampling using CNNs is required to significantly reduce the number of ray queries
Lidar sensors allow autonomous vehicles to measure a discrete set of Depth and reflectivity (intensity) of points. They determined the distance and reflectivity of the return power by firing pulses of a laser beam and measuring the time of flight. To capture these properties, the transmitted pulses from the attitude lidar sensor are modeled as a set of rays and volume-like rendering techniques are used.
Consider a laser beam ray that does not return any point. If the return power is too low, a phenomenon known as ray drop occurs, which is important for modeling that reduces simulated-actual differences [21]. Typically, such light travels far enough not to hit a surface, or it hits a surface where the beam bounces into an open space, such as a mirror, glass, or wet pavement. Modeling these effects is important for realistic simulation of sensors but, as stated in [14], is difficult to capture purely on a physics basis as they rely on (often undisclosed) details of low-level sensor detection logic. Therefore, we choose to learn ray fall from data. Similar to intensity, light features can be rendered volumetrically and passed through a small MLP to predict the light drop probability pd(r). Note that, unlike [14], the secondary echo of the lidar beam is not modeled since this information is not present in the five datasets in the experiment.
Expand the definition of neural feature field (NFF) to learning function (s, f) = NFF (x, t, d), where x is the spatial coordinate and t represents time, d represents the viewing angle direction. This definition introduces time as an input, crucial for modeling the dynamic aspects of the scene
The NFF architecture follows the recognized best approach in NeRF [4, 27]. Given a location x and a time t, query the actor-aware hash code. This encoding is then fed into a small MLP, which computes the signed distance s and intermediate features g. Encoding the view direction d with spherical harmonics [27] enables the model to capture reflections and other view-related effects. Finally, the direction encoding and intermediate features are jointly processed through a second MLP, enhanced with skip connections of g, resulting in feature f.
Similar to previous work [18, 29, 46, 47], we divide the world into two parts, That is, a static background and a set of rigid dynamic actors, each actor is defined by a 3D bounding box and a set of SO(3) poses. We serve the dual purpose of simplifying the learning process and allowing a degree of editability that allows dynamic actor generation of new scenarios after training. Unlike previous approaches that use separate NFFs for different scene elements, we use a single unified NFF where all networks are shared and the distinction between static and dynamic components is handled transparently by actor-aware hash encoding. The encoding strategy is simple: encode a given sample (x, t) with one of two functions based on whether it lies within the actor's bounding box
Using multi-resolution hash meshes to represent static scenes has proven to be a highly expressive and efficient representation method. However, to map unbounded scenes onto meshes, we adopt the shrinkage method proposed in MipNerf-360. This approach can accurately represent nearby road elements and distant clouds with a single hash mesh. In contrast, existing methods utilize specialized NFFs to capture the sky and other distant regions
When When the sample (x, t) falls within the bounding box of the actor, its spatial coordinate x and viewing direction d are converted to the actor's coordinate system at a given time t. Ignore the temporal aspect afterwards and sample features from a time-independent multi-resolution hash grid, just like a static scene. Simply put, multiple different hash grids need to be sampled separately, one for each actor. However, instead a single 4D hash grid is used, where the fourth dimension corresponds to the actor index. This approach allows sampling of all actor features in parallel, achieving significant speedups while matching the performance of individual hash grids.
One of the biggest challenges in applying neural rendering to automotive data is handling the multiple details present in this data level. When a car travels a long distance, it sees many surfaces, both at a distance and up close. In this multi-scale case, simply applying positional embeddings of iNGP [27] or NeRF can lead to aliasing artifacts [2]. To solve this problem, many methods model rays as frustum, the longitudinal direction of the frustum is determined by the size of the bin, and the radial direction is determined by the pixel area and the distance from the sensor [2, 3, 13]
Zip-NeRF[4] is currently the only anti-aliasing method for iNGP hash grids, which combines two frustum modeling techniques: multi-sampling and weight reduction. In multisampling, the position embeddings at multiple positions of the frustum are averaged, capturing longitudinal and radial extents. For downweighting, each sample is modeled as an isotropic Gaussian, with grid features weighted proportionally to the ratio between cell size and Gaussian variance, effectively suppressing finer resolutions. While combining techniques significantly improves performance, multisampling also significantly increases runtime. So the goal of this paper is to incorporate scale information with minimal operational impact. Inspired by Zip-NeRF, the authors propose an intuitive weight reduction scheme that reduces the weight of hash grid features relative to their size relative to the frustum.
Another difficulty in rendering large-scale scenes is the need for efficient sampling strategies. In one image, you might want to render detailed text on a nearby traffic sign while capturing the parallax effect between skyscrapers several kilometers away. To achieve both goals, uniform sampling of rays would require thousands of samples per ray, which is computationally infeasible. Previous work has relied heavily on lidar data to prune samples [47], making it difficult to render outside of lidar work.
In contrast, this paper renders the samples along the ray according to the power function [4], so that the space between samples increases with the distance from the ray origin. Even so, it is impossible to satisfy all relevant conditions with a drastic increase in sample size. Therefore, two rounds of proposal sampling [25] are also used, where a lightweight version of the neural feature field (NFF) is queried to generate a weight distribution along the ray. Then, a new set of samples is rendered based on these weights. After two rounds of this process, a refined set of samples is obtained that are concentrated at relevant positions on the ray and can be used to query the full-scale NFF. To supervise the proposed network, an anti-aliasing online distillation method [4] is adopted and is further supervised using lidar.
In the standard NeRF-based formulation, it is assumed that each image is captured from an origin o. However, many camera sensors have rolling shutter, where rows of pixels are captured sequentially. Therefore, the camera sensor can move between the capture of the first row and the capture of the last row, breaking the assumption of a single origin. While this is not an issue with synthetic data [24] or data shot with slow handheld cameras, rolling shutter becomes noticeable in shots of fast moving vehicles, especially side cameras. The same effect is present in lidar, where each scan is typically collected in 0.1s, which equates to several meters of movement when traveling at highway speeds. Even for self-motion compensated point clouds, these differences can lead to harmful line-of-sight errors, where 3D points are transformed into rays passing through other geometry. To mitigate these effects, a rolling shutter is modeled by assigning each ray a separate time and adjusting its origin based on the estimated motion. Since rolling shutter affects all dynamic elements of the scene, linear interpolation is performed for each individual light time and actor pose.
Another issue when simulating autonomous driving sequences is that the images are from different cameras, with potentially different captures Parameters such as exposure. Here, inspiration is taken from the research on “NeRFs in the wild” [22], where appearance embeddings are learned for each image and passed to the second MLP along with its features. However, when it is known which image comes from which sensor, a single embedding is instead learned for each sensor, minimizing the possibility of overfitting and allowing these sensor embeddings to be used when generating new views. These embeddings are applied after volume rendering, significantly reducing computational overhead when rendering features instead of colors.
The model relies on the estimation of dynamic actor poses, whether in the form of annotations or as Trace output. To address the shortcomings, actor poses are incorporated into the model as learnable parameters and jointly optimized. The attitude is parameterized as translation t and rotation R, using 6D-representation [50].
NeuRAD is implemented in the Nerfstudio[33] open source project. Training is performed for 20,000 iterations using the Adam[17] optimizer. On an NVIDIA A100, training takes about 1 hour
Reproducing UniSim: UniSim [47] is a neural closed-loop sensor simulator. It features photorealistic rendering and makes few assumptions about available supervision, i.e. it only requires camera images, lidar point clouds, sensor poses and 3D bounding boxes with dynamic actor trajectories. These properties make UniSim a suitable baseline as it is easily applicable to new autonomous driving data sets. However, the code is closed source and there is no unofficial implementation. Therefore, this article chooses to re-implement UniSim as its own model and implement it in Nerfstudio [33]. Since the main UniSim article does not detail many model details, one has to rely on the supplementary material provided by IEEE Xplore. Nonetheless, some details remain unknown and the authors have tuned these hyperparameters to match the reported performance on 10 selected PandaSet [45] sequences.
The above is the detailed content of NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving. For more information, please follow other related articles on the PHP Chinese website!