Home > Article > Technology peripherals > Wear a VR helmet to teach a robot to grasp, and the robot learns it on the spot
In recent years, many interesting developments have emerged in the field of robotics, such as robot dogs that can dance, play football, and bipedal robots that move things. Typically these robots rely on generating control strategies based on sensory input. Although this approach avoids the challenges of developing state estimation modules, modeling object properties, and tuning controller gains, it requires significant domain expertise. Even though much progress has been made, learning bottlenecks make it difficult for robots to perform arbitrary tasks and achieve universal goals.
To understand the key to robot learning, a core question is: How do we collect training data for robots? One approach is to collect data about the robot through a self-supervised data collection strategy. While this approach is relatively robust, it often requires thousands of hours of data interaction with the real world, even for relatively simple operational tasks. The other is to train on simulated data and then transfer to real robots (Sim2Real). This allows robots to learn complex robotic behaviors orders of magnitude faster. However, setting up a simulated robotic environment and specifying simulator parameters often requires extensive domain expertise.
In fact, there is a third method. To collect training data, you can also ask human teachers to provide demonstrations, and then train the robot to quickly imitate human demonstrations. This imitation approach has recently shown great potential in a variety of challenging operational problems. However, most of these works suffer from a fundamental limitation—it is difficult to collect high-quality demonstration data for robots.
Based on the above issues, researchers from New York University and Meta AI proposed HOLO-DEX, a new framework for collecting demonstration data and training dexterous robots. It uses a VR headset (such as the Quest 2) to place human teachers in an immersive virtual world. In this virtual world, teachers can see what the robot "sees" through the robot's eyes and control the Allegro manipulator via built-in pose detectors.
Looks like a human teaching the robot to do the movements "step by step":
HOLODEX allows humans Seamlessly providing high-quality demonstration data to robots through a low-latency observation feedback system, it has the following three advantages:
## Paper link: https://arxiv.org/pdf/2210.06463.pdf
Project link: https://holo-dex.github.io/
Code link: https:/ /github.com/SridharPandian/Holo-Dex
To evaluate the performance of HOLO-DEX, the study conducted experiments on six tasks requiring dexterity, including handheld Objects, unscrewing bottle caps with one hand, etc. The study found that human teachers using HOLO-DEX were 1.8 times faster than previous work on single-image teleoperation (teleoperation). On 4/6 tasks, the success rate of the HOLO-DEX learning strategy exceeds 90%. Additionally, the study found that the dexterous strategies learned through HOLO-DEX can generalize to new, unseen target objects.
Overall, the contributions of this study include:
HOLO-DEX Architecture Overview
As shown in Figure 1 below, HOLO-DEX operates in two stages. In the first phase, a human teacher uses a virtual reality (VR) headset to provide a demonstration to the robot. This stage includes creating a virtual world for teaching, estimating the teacher's hand posture, relocating the teacher's hand posture to the robot hand, and finally controlling the robot hand. After collecting some demonstrations in the first phase, the second phase of HOLO-DEX learns visual strategies to solve the demonstrated tasks.
The study used the Meta Quest 2 VR headset to place human teachers in the virtual world, with a resolution of 1832 × 1920 and a refresh rate of 72 Hz. The base version of the headset is priced at $399 and is relatively light at 503 grams, making presentations easier and more comfortable for teachers. What’s more, Quest 2’s API interface allows for the creation of custom mixed reality worlds that visualize robotic systems alongside diagnostic panels in VR.
Using VR headset to estimate hand postureCompared to the previous about dexterity Compared to teleoperation work, using VR headsets has three benefits in terms of hand pose estimation for human teachers. First, since Quest 2 uses 4 monochrome cameras, its gesture estimator is much more powerful than the single-camera estimator. Second, because the cameras are internally calibrated, they do not require the specialized calibration procedures required in previous multi-camera teleoperation frameworks. Third, since the hand pose estimator is integrated into the device, it is able to transmit real-time poses at 72Hz. Previous research has pointed out that a major challenge in dexterous teleoperation is acquiring hand postures with high accuracy and frequency, and HOLO-DEX significantly simplifies this problem by using a commercial-grade VR headset.
Hand Pose Retargeting
Next, the teacher’s hand pose extracted from VR needs to be retargeted to the robot hand. This first involves calculating the angles of each joint of the teacher's hand, and then a direct reorientation method is to "command" the robot's joints to move to the corresponding angles. This method worked for all fingers in the study except the thumb, but the shape of the Allegro robotic hand doesn't exactly match that of humans, so the method doesn't work entirely with the thumb.To solve this problem, this study maps the spatial coordinates of the teacher's thumb tip to the robot's thumb tip, and then calculates the thumb's joint angle through an inverse kinematics solver. It should be noted that since the Allegro manipulator does not have a pinky finger, the study ignored the angle of the teacher's pinky finger.
The entire posture redirection process does not require any calibration or teacher-specific adjustments to collect demos. But the study found that thumb redirection could be improved by finding a specific mapping from the teacher's thumb to the robot's thumb. The entire process is computationally cheap and can transmit the desired robot hand pose at 60 Hz.
Allegro Hand performs asynchronous control through the ROS communication framework. Given the robot hand joint positions calculated by the reorientation program, this study uses a PD controller to output the required torque at 300Hz. To reduce the steady-state error, this study uses a gravity compensation module to calculate the offset torque. In latency tests, the study found that sub-100 millisecond latency was achieved when the VR headset was on the same local network as the robotic hand. Low latency and low error rates are critical to HOLO-DEX as this allows intuitive teleoperation of the robotic hand by a human teacher.
When human teachers control the robot hand, they can see the robot's changes in real time (60Hz). This allows the teacher to correct execution errors of the robot hand. During the teaching process, the study recorded observation data from three RGBD cameras and the robot's motion information at a frequency of 5Hz. The study had to reduce recording frequency due to the large data footprint and associated bandwidth required to record multiple cameras.
After collecting the data, we enter the second stage. HOLO-DEX needs to train the visual strategy on the data. This study adopts nearest neighbor imitation (INN) algorithm for learning. In previous work, INN was shown to produce smart state-based policies on Allegro. HOLO-DEX goes a step further and demonstrates that these visual strategies generalize to novel objects in a variety of dexterous manipulation tasks.
In order to choose a learning algorithm to obtain low-dimensional embeddings, this study tried several state-of-the-art self-supervised learning algorithms and found that BYOL provided the best nearest neighbor results, so BYOL was selected As a basic self-supervised learning method.
Table 1 below shows that HOLO-DEX collects successful demos 1.8 times faster than DIME. For 3/6 tasks requiring precise 3D motion, the study found that single-image teleoperation was not even sufficient to collect a single demonstration.
This study examined the performance of various imitation learning strategies on dexterity tasks. The success rate of each task is shown in Table 2 below.
Since the strategies proposed in this study are vision-based and do not require explicit estimation of the state of objects, they can be compared with those not seen in training objects are compatible. The study evaluated its manual manipulation strategies that were trained to perform plane rotation, object flipping, and Can Spinning tasks on objects of a variety of visual appearances and geometries, as shown in Figure 5 below.
In addition, the study also tested the performance of HOLO-DEX on data sets of different sizes for different tasks. The visualization results are shown in the figure below .
The above is the detailed content of Wear a VR helmet to teach a robot to grasp, and the robot learns it on the spot. For more information, please follow other related articles on the PHP Chinese website!