NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving-AI-php.cn

Home

Technology peripherals

NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving

王林

Dec 05, 2023 am 11:21 AM

dataAutopilot

The paper "NeuRAD: Neural Rendering for Autonomous Driving" comes from Zenseact, Chalmers University of Technology, Linkoping University and Lund University.

NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving Neural Radiation Fields (NeRF) are becoming increasingly popular in the autonomous driving (AD) community. Recent methods have shown the potential of NeRFs in closed-loop simulations, AD system testing, and training data augmentation techniques. However, existing methods often require long training time, intensive semantic supervision, and lack generalizability. This in turn hinders the large-scale application of NeRF in AD. This paper proposes NeuRAD, a robust new view synthesis method for dynamic AD data. The method features a simple network design, sensor modeling including cameras and lidar (including rolling shutter, beam divergence and light fall), and works on multiple data sets out of the box.

As shown in the figure: NeuRAD is a neural rendering method tailored for dynamic car scenes. The posture of the own vehicle and other road users can be changed, and participants can be added and/or removed freely. These features make NeuRAD suitable as the basis for components such as sensor-realistic closed-loop simulators or powerful data augmentation engines.

NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving The goal of this paper is to learn a representation from which real sensor data can be generated, which can change the vehicle platform, the actor's posture, or both . It is assumed that there is access to data collected by the mobile platform, consisting of set camera images and lidar point clouds, as well as estimates of the size and pose of any mobile actor. For practicality, the method needs to perform well in terms of reconstruction error on major automotive datasets while keeping training and inference time to a minimum.

The figure is an overview of the method proposed in this article NeuRAD: learning a static and dynamic joint neural feature field for automotive scenes, distinguished by actor-aware hash coding. Points falling within the actor's bounding box are converted to actor local coordinates and used together with the actor index to query the 4D hash grid. The volume-rendered light-level features are decoded into RGB values using an upsampling CNN and into ray fall probabilities and intensities using an MLP.

NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving Based on the work of new view synthesis [4, 47], the author uses neural feature fields (NFF), generalizations of NeRFs [25] and similar methods [23] Model the world.

In order to render an image, a set of camera rays need to be volume rendered to generate a feature map F. As described in the paper [47], a convolutional neural network (CNN) is then used to render the final image. In practical applications, the resolution of feature maps is low and upsampling using CNNs is required to significantly reduce the number of ray queries

Lidar sensors allow autonomous vehicles to measure a discrete set of Depth and reflectivity (intensity) of points. They determined the distance and reflectivity of the return power by firing pulses of a laser beam and measuring the time of flight. To capture these properties, the transmitted pulses from the attitude lidar sensor are modeled as a set of rays and volume-like rendering techniques are used.

Consider a laser beam ray that does not return any point. If the return power is too low, a phenomenon known as ray drop occurs, which is important for modeling that reduces simulated-actual differences [21]. Typically, such light travels far enough not to hit a surface, or it hits a surface where the beam bounces into an open space, such as a mirror, glass, or wet pavement. Modeling these effects is important for realistic simulation of sensors but, as stated in [14], is difficult to capture purely on a physics basis as they rely on (often undisclosed) details of low-level sensor detection logic. Therefore, we choose to learn ray fall from data. Similar to intensity, light features can be rendered volumetrically and passed through a small MLP to predict the light drop probability pd(r). Note that, unlike [14], the secondary echo of the lidar beam is not modeled since this information is not present in the five datasets in the experiment.

Expand the definition of neural feature field (NFF) to learning function (s, f) = NFF (x, t, d), where x is the spatial coordinate and t represents time, d represents the viewing angle direction. This definition introduces time as an input, crucial for modeling the dynamic aspects of the scene

Neural Architecture

The NFF architecture follows the recognized best approach in NeRF [4, 27]. Given a location x and a time t, query the actor-aware hash code. This encoding is then fed into a small MLP, which computes the signed distance s and intermediate features g. Encoding the view direction d with spherical harmonics [27] enables the model to capture reflections and other view-related effects. Finally, the direction encoding and intermediate features are jointly processed through a second MLP, enhanced with skip connections of g, resulting in feature f.

Scene composition

Similar to previous work [18, 29, 46, 47], we divide the world into two parts, That is, a static background and a set of rigid dynamic actors, each actor is defined by a 3D bounding box and a set of SO(3) poses. We serve the dual purpose of simplifying the learning process and allowing a degree of editability that allows dynamic actor generation of new scenarios after training. Unlike previous approaches that use separate NFFs for different scene elements, we use a single unified NFF where all networks are shared and the distinction between static and dynamic components is handled transparently by actor-aware hash encoding. The encoding strategy is simple: encode a given sample (x, t) with one of two functions based on whether it lies within the actor's bounding box

Unbounded static scene

Using multi-resolution hash meshes to represent static scenes has proven to be a highly expressive and efficient representation method. However, to map unbounded scenes onto meshes, we adopt the shrinkage method proposed in MipNerf-360. This approach can accurately represent nearby road elements and distant clouds with a single hash mesh. In contrast, existing methods utilize specialized NFFs to capture the sky and other distant regions

rigid dynamic actors

When When the sample (x, t) falls within the bounding box of the actor, its spatial coordinate x and viewing direction d are converted to the actor's coordinate system at a given time t. Ignore the temporal aspect afterwards and sample features from a time-independent multi-resolution hash grid, just like a static scene. Simply put, multiple different hash grids need to be sampled separately, one for each actor. However, instead a single 4D hash grid is used, where the fourth dimension corresponds to the actor index. This approach allows sampling of all actor features in parallel, achieving significant speedups while matching the performance of individual hash grids.

Multi-Scale Scene Problem

One of the biggest challenges in applying neural rendering to automotive data is handling the multiple details present in this data level. When a car travels a long distance, it sees many surfaces, both at a distance and up close. In this multi-scale case, simply applying positional embeddings of iNGP [27] or NeRF can lead to aliasing artifacts [2]. To solve this problem, many methods model rays as frustum, the longitudinal direction of the frustum is determined by the size of the bin, and the radial direction is determined by the pixel area and the distance from the sensor [2, 3, 13]

Zip-NeRF[4] is currently the only anti-aliasing method for iNGP hash grids, which combines two frustum modeling techniques: multi-sampling and weight reduction. In multisampling, the position embeddings at multiple positions of the frustum are averaged, capturing longitudinal and radial extents. For downweighting, each sample is modeled as an isotropic Gaussian, with grid features weighted proportionally to the ratio between cell size and Gaussian variance, effectively suppressing finer resolutions. While combining techniques significantly improves performance, multisampling also significantly increases runtime. So the goal of this paper is to incorporate scale information with minimal operational impact. Inspired by Zip-NeRF, the authors propose an intuitive weight reduction scheme that reduces the weight of hash grid features relative to their size relative to the frustum.

Efficient Sampling

Another difficulty in rendering large-scale scenes is the need for efficient sampling strategies. In one image, you might want to render detailed text on a nearby traffic sign while capturing the parallax effect between skyscrapers several kilometers away. To achieve both goals, uniform sampling of rays would require thousands of samples per ray, which is computationally infeasible. Previous work has relied heavily on lidar data to prune samples [47], making it difficult to render outside of lidar work.

In contrast, this paper renders the samples along the ray according to the power function [4], so that the space between samples increases with the distance from the ray origin. Even so, it is impossible to satisfy all relevant conditions with a drastic increase in sample size. Therefore, two rounds of proposal sampling [25] are also used, where a lightweight version of the neural feature field (NFF) is queried to generate a weight distribution along the ray. Then, a new set of samples is rendered based on these weights. After two rounds of this process, a refined set of samples is obtained that are concentrated at relevant positions on the ray and can be used to query the full-scale NFF. To supervise the proposed network, an anti-aliasing online distillation method [4] is adopted and is further supervised using lidar.

Modeling Rolling Shutter

In the standard NeRF-based formulation, it is assumed that each image is captured from an origin o. However, many camera sensors have rolling shutter, where rows of pixels are captured sequentially. Therefore, the camera sensor can move between the capture of the first row and the capture of the last row, breaking the assumption of a single origin. While this is not an issue with synthetic data [24] or data shot with slow handheld cameras, rolling shutter becomes noticeable in shots of fast moving vehicles, especially side cameras. The same effect is present in lidar, where each scan is typically collected in 0.1s, which equates to several meters of movement when traveling at highway speeds. Even for self-motion compensated point clouds, these differences can lead to harmful line-of-sight errors, where 3D points are transformed into rays passing through other geometry. To mitigate these effects, a rolling shutter is modeled by assigning each ray a separate time and adjusting its origin based on the estimated motion. Since rolling shutter affects all dynamic elements of the scene, linear interpolation is performed for each individual light time and actor pose.

Different Camera Settings

Another issue when simulating autonomous driving sequences is that the images are from different cameras, with potentially different captures Parameters such as exposure. Here, inspiration is taken from the research on “NeRFs in the wild” [22], where appearance embeddings are learned for each image and passed to the second MLP along with its features. However, when it is known which image comes from which sensor, a single embedding is instead learned for each sensor, minimizing the possibility of overfitting and allowing these sensor embeddings to be used when generating new views. These embeddings are applied after volume rendering, significantly reducing computational overhead when rendering features instead of colors.

Noisy Actor Pose

The model relies on the estimation of dynamic actor poses, whether in the form of annotations or as Trace output. To address the shortcomings, actor poses are incorporated into the model as learnable parameters and jointly optimized. The attitude is parameterized as translation t and rotation R, using 6D-representation [50].

NeuRAD is implemented in the Nerfstudio[33] open source project. Training is performed for 20,000 iterations using the Adam[17] optimizer. On an NVIDIA A100, training takes about 1 hour

Reproducing UniSim: UniSim [47] is a neural closed-loop sensor simulator. It features photorealistic rendering and makes few assumptions about available supervision, i.e. it only requires camera images, lidar point clouds, sensor poses and 3D bounding boxes with dynamic actor trajectories. These properties make UniSim a suitable baseline as it is easily applicable to new autonomous driving data sets. However, the code is closed source and there is no unofficial implementation. Therefore, this article chooses to re-implement UniSim as its own model and implement it in Nerfstudio [33]. Since the main UniSim article does not detail many model details, one has to rely on the supplementary material provided by IEEE Xplore. Nonetheless, some details remain unknown and the authors have tuned these hyperparameters to match the reported performance on 10 selected PandaSet [45] sequences.

The above is the detailed content of NeuRAD: Application of leading multi-dataset neural rendering technology in autonomous driving. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

A Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Zend Studio 13.0.1

Powerful PHP integrated development environment

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software