Home > Article > Technology peripherals > The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation
In the field of autonomous driving, with the development of BEV-based sub-tasks/end-to-end solutions, high-quality multi-view training data and the corresponding simulation scene construction are increasingly important. In response to the pain points of current tasks, "high quality" can be decoupled into three aspects:
For simulation, video generation that meets the above conditions can be generated directly through layout, which is undoubtedly the most direct way to construct multi-agent sensor input. DrivingDiffusion solves the above problems from a new perspective.
Layout-controlled multi-view image generation
The figure shows the multi-view image generation effect with layout projection as input.Adjust the layout: Precisely control the generated results
The upper part of the figure shows the diversity of generated results and the modules below The importance of design. The lower part shows the results of perturbing the vehicle directly behind, including the generation effects of moving, turning, colliding and even floating in the air.Layout controlled multi-view video generation
Generate subsequent frames based on the text description of the input frame
Use a single frame image as input to construct subsequent frame driving scenes based on the text description of the main car/other cars. The first three rows and the fourth row in the figure show the generation effect after text description control of the behavior of the main vehicle and other vehicles respectively. (The green box is the input, the blue box is the output)
Generate subsequent frames directly based on the input frame
No need for other controls, only Use a single frame of image as input to predict subsequent frames of driving scenes. (The green box is the input, the blue box is the output)DrivingDiffusion first artificially constructs all 3D true values (obstacles/road structures) in the scene. After projecting the true values into Layout images, this is used as model input to obtain the multi-camera perspective. Real images/videos. The reason why 3D true values (BEV views or encoded instances) are not used directly as model input, but parameters are used for post-projection input, is to eliminate systematic 3D-2D consistency errors. (In such a set of data, 3D true values and vehicle parameters are artificially constructed according to actual needs. The former brings the ability to construct rare scene data at will. , the latter eliminates the error of geometric consistency in traditional data production.)
There is still one question left at this time: whether the quality of the generated image/video can meet the usage requirements ? When it comes to constructing scenarios, everyone often thinks of using a simulation engine. However, there is a large domain gap between the data it generates and the real data. The generated results of GAN-based methods often have a certain bias from the distribution of actual real data. Diffusion Models are based on the characteristics of Markov chains that generate data by learning noise. The fidelity of the generated results is higher and is more suitable for use as a substitute for real data. DrivingDiffusion directly generatessequential multi-view views according to artificially constructed scenes and vehicle parameters, which can not only be used as a reference for downstream autonomous driving tasks Training data can also be used to build a simulation system for feedback on autonomous driving algorithms.
The "artificially constructed scene" here only contains obstacles and road structure information, but DrivingDiffusion's framework can easily introduce layout information such as signboards, traffic lights, construction areas, and even low-level occupation grid/depth map and other control modes.There are several difficulties when generating multi-view videos:
DrivingDiffusion generates long video process
Single frame multi-view model: generate multi-view key frames,The single-view timing model with frames as additional control and multi-view sharing: perform timing expansion on each view in parallel,
##Consistency Module is divided into two parts: Consistency attention mechanism and Consistency correlation loss.
The consistency attention mechanism focuses on the interaction between adjacent views and timing-related frames. Specifically, for cross-frame consistency, it only focuses on the information interaction between left and right adjacent views that overlap. For the timing model, each Frame only focuses on keyframes and the previous frame. This avoids the huge computational load caused by global interactions. The consistent correlation loss adds geometric constraints by pixel-level correlation and regression of pose, whose gradient is provided by a pre-trained pose regressor. The regressor adds a pose regression head based on LoFTR and is trained using the true pose values on the real data of the corresponding data set. For multi-view models and time series models, this module supervises the camera relative pose and main vehicle motion pose respectively.Local Prompt and Global Prompt cooperate to reuse the parameter semantics of CLIP and stable-diffusion-v1-4 to locally enhance specific category instance areas. As shown in the figure, based on the cross-attention mechanism of image tokens and global text description prompts, the author designs a local prompt for a certain category and uses the image token in the mask area of the category to query the local prompt. This process makes maximum use of the concept of text-guided image generation in the open domain in the original model parameters.
main vehicle control text description controller and other vehicle control/environment text description controller are decoupled.
Although methods such as FID are often used to measure the quality of image synthesis, they do not fully feedback the design goals of the task, nor do they reflect the synthesis quality of different semantic categories. Since the task is dedicated to generating multi-view images consistent with 3D layouts, DrivingDiffuison proposes to use the BEV perceptual model metric to measure performance in terms of consistency: using the official models of CVT and BEVFusion as evaluators, using the same real 3D model as the nuScenes validation set Generate images conditionally on the layout, perform CVT and BevFusion inference on each set of generated images, and then compare the predicted results with the real results, including the average intersection over U (mIoU) score of the drivable area and the NDS of all object classes. The statistics are shown in Table 2. Experimental results show that the perception indicators of the synthetic data evaluation set are very close to those of the real evaluation set, which reflects the high consistency of the generated results and 3D true values and the high fidelity of the image quality.
In addition to the above experiments, DrivingDiffusion has conducted experiments on adding synthetic data training to address the main problem it solves - improving the performance of autonomous driving downstream tasks. Table 3 demonstrates the performance improvements achieved by synthetic data augmentation in BEV perception tasks. In the original training data, there are problems with long-tail distributions, especially for small targets, close-range vehicles, and vehicle orientation angles. DrivingDiffusion focuses on generating additional data for these classes with limited samples to solve this problem. After adding 2000 frames of data focused on improving the distribution of obstacle orientation angles, NDS improved slightly, while mAOE dropped significantly from 0.5613 to 0.5295. After using 6000 frames of synthetic data that is more comprehensive and focused on rare scenes to assist training, a significant enhancement can be observed on the nuScenes validation set: NDS increased from 0.412 to 0.434, and mAOE decreased from 0.5613 to 0.5130. This demonstrates the significant improvement that data augmentation of synthetic data can bring to perception tasks. Users can make statistics on the distribution of each dimension in the data based on actual needs, and then supplement it with targeted synthetic data.
DrivingDiffusion simultaneously realizes the ability to generate multi-view videos of autonomous driving scenes and predict the future, which is of great significance to autonomous driving tasks. Among them, layout and parameters are all artificially constructed and the conversion between 3D-2D is through projection rather than relying on learnable model parameters, which eliminates geometric errors in the previous process of obtaining data. , has strong practical value. At the same time, DrivingDiffuison is extremely scalable and supports new scene content layouts and additional controllers. It can also losslessly improve the generation quality through super-resolution and video frame insertion technology.
In autonomous driving simulation, there are more and more attempts at Nerf. However, in the task of street view generation, the separation of dynamic and static content, large-scale block reconstruction, decoupling appearance control of weather and other dimensions, etc., bring a huge amount of work. In addition, Nerf often needs to be carried out in a specific range of scenes. Only after training can it support new perspective synthesis tasks in subsequent simulations. DrivingDiffusion naturally contains a certain amount of general knowledge prior, including visual-text connections, conceptual understanding of visual content, etc. It can quickly create a scene according to needs just by constructing the layout. However, as mentioned above, the entire process is relatively complex, and the generation of long videos requires post-processing model fine-tuning and expansion. DrivingDiffusion will continue to explore the compression of perspective dimensions and time dimensions, as well as combine Nerf for new perspective generation and conversion, and continue to improve generation quality and scalability.
The above is the detailed content of The first multi-view autonomous driving scene video generation world model | DrivingDiffusion: New ideas for BEV data and simulation. For more information, please follow other related articles on the PHP Chinese website!