Home >Technology peripherals >AI >Let the robot sense your 'Here you are', the Tsinghua team uses millions of scenarios to create universal human-machine handover
Researchers from the Interdisciplinary Information Institute of Tsinghua University proposed a framework called "GenH2R", which aims to allow robots to learn a universal vision-based human-machine handover strategy. This strategy allows the robot to more reliably catch various objects with diverse shapes and complex motion trajectories, bringing new possibilities for human-computer interaction. This research provides an important breakthrough for the development of the field of artificial intelligence and brings greater flexibility and adaptability to the application of robots in real-life scenarios.
With the advent of the era of embodied intelligence (Embodied AI), we expect intelligent bodies to actively interact with the environment. In this process, it has become crucial to integrate robots into the human living environment and interact with humans (Human Robot Interaction). We need to think about how to understand human behavior and intentions, meet their needs in a way that best meets human expectations, and put humans at the center of embodied intelligence (Human-Centered Embodied AI). One of the key skills is Generalizable Human-to-Robot Handover, which enables robots to better cooperate with humans to complete a variety of common daily tasks, such as cooking, home organization, and furniture assembly. .
The explosive development of large models indicates that large-scale learning from massive high-quality data is a possible way to move towards general intelligence. So, can general intelligence be obtained through massive robot data and large-scale strategy imitation? Human-machine handover skills? However, if you consider that large-scale interactive learning between robots and humans in the real world is dangerous and expensive, the machines are likely to harm humans:
Train in a simulation environment, and use character simulation and dynamic grasping motion planning to automatically provide a large amount of diverse robot learning data, and then apply these data to real robots. This learning-based method is called " Sim-to-Real Transfer", which can significantly improve the collaborative interaction capabilities between robots and humans and has higher reliability.
Therefore, the "GenH2R" framework was proposed, starting from three perspectives: Simulation, Demonstration, and Imitation. ,Let the robot learn universal handover for any grasping method, any handover trajectory, and any object geometry for the first time based on an end-to-end approach: 1) Provides millions of levels in the "GenH2R-Sim" environment Various complex simulation handover scenarios that are easy to generate, 2) introduce a set of automated expert demonstrations (Expert Demonstrations) generation process based on vision-action collaboration, 3) use imitation learning based on 4D information and prediction assistance (point cloud time) (Imitation Learning) method.
Compared with the SOTA method (CVPR2023 Highlight), the average success rate of GenH2R's method on various test sets is increased by 14%, the time is shortened by 13%, and on real machines The performance is more robust in experiments.
#In order to help players who have not yet passed the level, let us learn about the details of "Simulation Environment (GenH2R-Sim)" How to solve the puzzle.
To generate high-quality, large-scale human hand-object datasets, the GenH2R-Sim environment models the scene in terms of both grasping poses and motion trajectories.
In terms of grasping postures, GenH2R-Sim introduces rich 3D object models from ShapeNet, selects 3266 daily objects suitable for handover, and uses the generation method of dexterous grasping (DexGraspNet), a total of 1 million scenes of human hands grasping objects were generated. In terms of motion trajectories, GenH2R-Sim uses several control points to generate multiple smooth Bézier curves, and introduces the rotation of human hands and objects to simulate various complex motion trajectories of hand-delivered objects.
In the 1 million scenes of GenH2R-Sim, it far exceeds the latest work not only in terms of motion trajectories (1,000 vs 1 million) and number of objects (20 vs 3266) , In addition, it also introduces interactive information that is close to the real situation (such as when the robot arm is close enough to the object, the human will stop the movement and wait for the handover to be completed), rather than simple trajectory playback. Although the data generated by simulation is not completely realistic, experimental results show that large-scale simulation data is more conducive to learning than small-scale real data.
B. Large-scale generation of expert examples that are beneficial to distillation
Based on large-scale human hand and object motion trajectory data ,GenH2R automatically generates a large number of expert ,examples. The "experts" GenH2R seeks are improved Motion Planners (such as OMG Planner). These methods are non-learning, control-optimized, and do not rely on visual point clouds. They often require some scene states (such as the target grabbing position of the object). ). In order to ensure that the subsequent visual policy network can distill information beneficial to learning, the key is to ensure that the examples provided by the “experts” have vision-action correlation. If the final landing point is known during planning, the robotic arm can ignore vision and directly plan to the final position to "wait and wait". This may cause the robot's camera to be unable to see the object. This example does not help the downstream visual strategy network; If the robot arm is frequently replanned based on the position of the object, it may cause the robot arm to move discontinuously and appear in strange shapes, making it impossible to complete reasonable grasping.
To generate Distillation-friendly expert examples, GenH2R introduces Landmark Planning. The movement trajectory of the human hand will be divided into multiple segments according to the smoothness and distance of the trajectory, with Landmark as the segmentation mark. In each segment, the human hand trajectory is smooth and the expert method plans towards the Landmark points. This approach ensures both visual-action correlation and action continuity.
C. Prediction-assisted 4D imitation learning network
Based on large-scale expert examples, GenH2R uses imitation learning methods to build a 4D policy network to decompose the observed time series point cloud information into geometry and motion. For each frame point cloud, the pose transformation between the point cloud of the previous frame and the iterative closest point algorithm is calculated to estimate the flow information of each point, so that the point cloud of each frame All have movement characteristics. Then, PointNet is used to encode each frame of point cloud, and finally not only decodes the final required 6D egocentric action, but also outputs a prediction of the future pose of the object, enhancing the policy network's ability to predict future hand and object movements.
Different from more complex 4D Backbone (such as Transformer-based), this network architecture has fast reasoning speed and is more suitable for handing over objects This kind of human-computer interaction scenario requires low latency. At the same time, it can also effectively utilize timing information, achieving a balance between simplicity and effectiveness.
A. Simulation environment experiment
GenH2R and The SOTA method was compared under various settings. Compared with the method of using small-scale real data for training, the method of using large-scale simulation data for training in GenH2R-Sim can achieve significant advantages (in various test sets The success rate is increased by 14% on average and the time is shortened by 13%).
In the real data test set s0, the GenH2R method can successfully hand over more complex objects, and can adjust the posture in advance to avoid frequent posture adjustments when the gripper is close to the object:
In the simulation data test set t0 (introduced by GenH2R-sim), GenH2R's method can predict the future posture of the object to achieve a more reasonable approach trajectory:
In the real data test set t1 (GenH2R-sim was introduced from HOI4D, which is about 7 times larger than the s0 test set in previous work), GenH2R's method can be generalized to unseen, Real world objects with different geometric shapes.
B. Real machine experiment
GenH2R simultaneously deploys the learned strategy to a real-world robotic arm Go up and complete the "sim-to-real" jump.
For more complex motion trajectories (such as rotation), GenH2R's strategy shows stronger adaptability; for more complex geometries, GenH2R's method can show stronger adaptability. Generalizability:
GenH2R has completed real-machine testing and user research on various handover objects, demonstrating strong robustness.
For more information on experiments and methods, please refer to the paper homepage.
The paper comes from Tsinghua University 3DVICI Lab, Shanghai Artificial Intelligence Laboratory and Shanghai Qizhi Research Institute. The author of the paper is Tsinghua University students Wang Zifan (co-author), Chen Junyu (co-author), Chen Ziqing and Xie Pengwei, the instructors are Yi Li and Chen Rui.
Tsinghua University’s Three-Dimensional Vision Computing and Machine Intelligence Laboratory (3DVICI Lab for short) is an artificial intelligence laboratory under the Institute of Interdisciplinary Information at Tsinghua University. It was established and directed by Professor Yi Li. 3DVICI Lab aims at the most cutting-edge issues of general three-dimensional vision and intelligent robot interaction in artificial intelligence. Its research directions cover embodied perception, interaction planning and generation, human-machine collaboration, etc., and are closely related to application fields such as robotics, virtual reality, and autonomous driving. The team's research goal is to enable intelligent agents to understand and interact with the three-dimensional world. The results have been published in major top computer conferences and journals.
The above is the detailed content of Let the robot sense your 'Here you are', the Tsinghua team uses millions of scenarios to create universal human-machine handover. For more information, please follow other related articles on the PHP Chinese website!