Home >Technology peripherals >AI >New progress in Li Feifei's 'Spatial Intelligence' series, Wu Jiajun's team's new 'BVS' suite evaluates computer vision models
In the 2024 TED speech not long ago, Li Feifei explained the concept of Spatial Intelligence in detail. She is delighted and extremely enthusiastic about the rapid development of the field of computer vision in the past few years, and is creating a start-up company for this purpose
In this speech, it was mentioned A research result of the Stanford team is BEHAVIOR, which is a behavioral and action data set they "created" to train computers and robots how to act in a three-dimensional world. BEHAVIOR is a huge data set that contains human behaviors and actions in various scenarios. The purpose of this data set is to allow computers and robots to better understand and imitate human behavior. By analyzing a large amount of data in BEHAVIOR, researchers can obtain
Now, Wu Jiajun led the team to publish a follow-up study-"BEHAVIOR Vision Suite (BVS)". The paper also received CVPR 2024 Highlight.
# In the field of computer vision, quantitative data and comprehensive, customized labels are required to systematically evaluate and understand the performance of models under different conditions. However, real-world visual datasets often struggle to meet these needs. Although promising alternatives such as AI tasks offer promising alternatives, there are still many shortcomings in terms of resource and rendering quality, data diversity, and realism of physical properties.
In order to solve these problems, the research team launched "BEHAVIOR Vision Suite (BVS)".
BVS is a set of tools and resources designed for systematic evaluation of computer vision models. Based on the newly developed AI benchmark BEHAVIOR-1K, BVS can adjust parameters, covering scene-level (such as lighting, object placement) and object-level (such as joint configuration, attributes) and camera-level (such as field of view, field of view, focal length). Researchers can adjust these parameters during the data collection process to further precisely control the experiment.
This model also demonstrates the advantages of BVS in different model evaluation and training applications. Including Parametrically controllable evaluation of the robustness of vision models to continuous changes in environmental parameters, systematic evaluation of scene understanding models (rich visual annotations) and model training for new vision tasks
.
BVS includes two major parts : Data section and customizable data generator based on it
##Data part##Data part of BVS. Based on the assets of BEHAVIOR-1K, it includes a total of 8841 3D object models and indoor scenes designed by 51 artists, expanded to 1000 scene instances. These models and scenes have a realistic appearance and cover rich semantics. Category. The research team also provides a script that allows users to automatically generate more enhanced scene instances
BEHAVIOR-1K’s asset expansion#.
##Customizable data generator
The customizable data generator allows users to conveniently use BVS The data part is used to generate image data sets that meet their needs, such as indoor scenes under dark light. BVS can make the generated data set have high semantic diversity while meeting the needs, while ensuring its fidelity and physical rationality. Specifically, users can control the following five aspects: camera position, lighting, object properties (such as size), object status (such as on, off), and spatial relationships between objects. The researchers demonstrated the role of data generated by BVS in three application scenarios, including: Parameters controllably evaluate the robustness of the vision model when environmental parameters continuously change By generating data that continuously changes in a certain dimension, the researchers systematically evaluate the robustness of the visual model under this change. For example, data with gradually increasing degrees of object occlusion in the same scene are generated to evaluate the performance of the visual model under partially occluded objects. By evaluating different SOTA models, researchers found that the performance of existing models on data outside common distributions is still insufficient. Since these data are difficult to obtain or label in the real world, these conclusions are difficult to draw directly from real image data sets. Therefore, BVS can help researchers evaluate the robustness of the model under the conditions of interest to them to better develop and improve the model. The existing SOTA model still has room for improvement in robustness under changing conditions (such as camera elevation) Performance of different detection models when five environmental parameters change continuously Evaluating scene understanding models Another major feature of the dataset generated by BVS is that it contains multi-modal real labels, such as Depth, semantic segmentation, target bounding box, etc. This allows researchers to use data generated by BVS to evaluate prediction models for different tasks on the same image. The research team evaluated the SOTA model for four tasks: open word detection and segmentation, depth estimation, and point cloud reconstruction, and found that the model's performance on the BVS data set was in the same order as in the corresponding tasks. The performance on real data benchmarks is consistent. This shows that the high-quality data generated by BVS truly reflects and represents real-world data, and researchers hope that such data sets can promote the development of multi-task prediction models. In the open source code, the research team also provides a script to facilitate users to sample trajectories in the scene. The researchers collected many scene browsing videos to evaluate the scene understanding model ##Overall scene understanding data set. The researchers generated a large number of traversal videos in representative scenes, each containing more than 10 camera trajectories. For each image, BVS generates various labels (e.g., scene map, segmentation mask, depth map) The relative performance order of the SOTA model on the BVS data is consistent with the real task benchmark Train the new vision task model #BVS data generation is not limited to model evaluation. For tasks that are difficult to collect or label data in real-life scenarios, BVS data can also be used for model training. The author used BVS to generate 12.5k pictures, and only used it to train an object spatial relationship and state prediction model. This model achieved an F1 score of 0.839 in real scenarios without using real data for training, reflecting excellent simulation-to-real transfer capabilities. Simulation generated training data set and real test data set illustration Object spatial relationship and state prediction model trained using data generated by BVS BVS provides a set of powerful A large set of tools and resources provides computer vision researchers with new ways to generate customized synthetic data sets. By systematically controlling and adjusting various parameters in the data generation process, researchers can more comprehensively evaluate and improve the performance of computer vision models, laying the foundation for future research and development. Application lays a solid foundation. Application scenarios
Summary
The above is the detailed content of New progress in Li Feifei's 'Spatial Intelligence' series, Wu Jiajun's team's new 'BVS' suite evaluates computer vision models. For more information, please follow other related articles on the PHP Chinese website!