Home > Article > Technology peripherals > How deep learning technology solves the problem of robots handling deformable objects
Translator | Li Rui
Reviewer | Sun Shujuan
For humans, processing deformable objects is not much more difficult than processing rigid objects. People naturally learn to shape them, fold them, and manipulate them in different ways and still be able to recognize them.
#But for robotics and artificial intelligence systems, manipulating deformable objects is a huge challenge. For example, a robot must take a series of steps to shape dough into a pizza crust. As the dough changes shape it must be recorded and tracked, and at the same time it must choose the right tool for each step of the job. These are challenging tasks for current artificial intelligence systems, which are more stable when dealing with rigid objects with more predictable states.
Now, a new deep learning technique developed by researchers at MIT, Carnegie Mellon University, and UC San Diego promises to make robotic systems more stable when handling deformable objects. The technology, called DiffSkill, uses deep neural networks to learn simple skills and a planning module to combine those skills to solve tasks that require multiple steps and tools.
If an artificial intelligence system wants to process an object, it must be able to detect and define its state and predict what it will look like in the future. For rigid objects this is a largely solved problem. With a good set of training examples, a deep neural network will be able to detect rigid objects from different angles. When deformable objects are involved, their multiple state spaces become even more complex.
Lin Xingyu, a doctoral student at Carnegie Mellon University and the lead author of the DiffSkill paper, said, "For a rigid object, we can use six numbers to describe its state: three numbers represent its XYZ coordinates, and another Three numbers represent its direction.
However, deformable objects such as dough or fabric have infinite degrees of freedom, making it more difficult to accurately describe their state. Furthermore, compared to rigid objects, The way they deform is also more difficult to model mathematically."
The development of differentiable physics simulators has enabled the application of gradient-based methods to solve deformable object manipulation tasks. This is different from traditional reinforcement learning methods, which try to learn the dynamics of the environment and objects through pure trial-and-error interactions.
DiffSkill is inspired by PlasticineLab, a differentiable physics simulator and presented at the 2021 ICLR conference. PlasticineLab shows that differentiable simulators can help with short-term tasks.
PlasticineLab is a deformable object simulator based on differentiable physics. It is suitable for training gradient-based models
But differentiable simulators still deal with long-term problems that require multiple steps and the use of different tools. Artificial intelligence systems based on differentiable simulators also require knowledge of the complete simulation state and related physical parameters of the environment. This is particularly limiting for real-world applications, where agents typically perceive the world through visual and depth-sensing data (RGB-D).
Lin Xingyu said, "We started asking if we could extract the steps required to complete a task into skills, and learn abstract concepts about skills so that we can link them to solve more complex tasks."
DiffSkill is a framework in which artificial intelligence agents learn skill abstractions using differentiable physical models and combine them to complete complex operational tasks.
His past work has focused on using reinforcement learning to manipulate deformable objects such as cloth, rope, and liquids. For DiffSkill, he chose dough manipulation because of the challenges it presented.
He said, "Dough manipulation is particularly interesting because it is not easily accomplished with a robot gripper, but requires using different tools in sequence, which is something humans are good at but robots are less common."
After training, DiffSkill can successfully complete a set of dough manipulation tasks using only RGB-D input.
DiffSkill The feasibility of training neural networks to predict target states from the initial states and parameters obtained from differentiable physics simulators
DiffSkill consists of two key components: a “neural skill abstractor” that uses neural networks to learn individual skills, and a “planner” for solving long-term tasks.
DiffSkill uses a differentiable physics simulator to generate training examples for the skill abstractor. These examples show how to use a single tool to achieve short-term goals, such as using a rolling pin to spread dough or using a spatula to move dough.
These examples are presented to skill abstractors in the form of RGB-D videos. Given an image observation, the skill abstractor must predict whether the desired goal is feasible. The model learns and adjusts its parameters by comparing its predictions to actual results from a physics simulator.
Robotic manipulation of deformable objects such as dough requires long-term reasoning about the use of different tools. The DiffSkill approach leverages differentiable simulators to learn and combine skills for these challenging tasks.
Meanwhile, DiffSkill trains variational autoencoders (VAEs) to learn latent space representations of examples generated by physics simulators. Variational autoencoders (VAE) retain important features and discard task-irrelevant information. By converting high-dimensional image space into latent space, variational autoencoders (VAEs) play an important role in enabling DiffSkill to plan over longer fields of view and predict outcomes from observing sensory data.
One of the important challenges in training a variational autoencoder (VAE) is ensuring that it learns the correct features and generalizes to the real world. In the real world, the composition of visual data is different from the data generated by a physical simulator. For example, the color of the rolling pin or cutting board is not relevant to the task, but the position and angle of the rolling pin and the position of the dough are.
Currently, the researchers are using a technique called "domain randomization," which randomizes irrelevant properties of the training environment, such as background and lighting, and preserves things like the position and orientation of tools. important features. This makes training variational autoencoders (VAEs) more stable when applied to the real world.
Lin Xingyu said, "It is not easy to do this because we need to cover all possible differences between simulation and the real world (called sim2real gap). A better way is to use 3D point cloud as the scene representation, which is easier to transfer from simulation to the real world. In fact, we are developing a follow-up project using point clouds as input."
DiffSkill uses the planning module to evaluate different skill combinations and sequences that can achieve a goal
Once the skill abstractor is trained, DiffSkill uses the planner module to solve long-term tasks. Planners must determine the number and sequence of skills required to get from the initial state to the destination.
This planner iterates through possible skill combinations and their intermediate results. Variational autoencoders come in handy here. Rather than predicting complete image results, DiffSkill uses VAEs to predict latent spatial results for intermediate steps toward the final goal.
The combination of abstraction skills and latent space representation makes drawing trajectories from initial states to goals more computationally efficient. In fact, the researchers did not need to refine the search function but conducted an exhaustive search across all combinations.
Lin Xingyu said, "Since we are planning skills, the calculation work will not be too much, and the time will not be long. This exhaustive search eliminates the need for planners to design sketches that may result in designers Novel solutions are considered in a more general way, although we did not observe this in the limited tasks we attempted. In addition, more sophisticated search techniques can be applied."
The DiffSkill paper states, "In Optimization of each skill set can be completed efficiently in about 10 seconds on a single NVIDIA 2080Ti GPU."
The researchers tested the performance of DiffSkill against several baseline methods that have been applied to deformable objects, including two model-free reinforcement learning algorithms and a trajectory optimizer using only a physics simulator
The models were tested on multiple tasks requiring multiple steps and tools. In one of the tasks, for example, the AI agent had to lift the dough with a spatula, place it on a cutting board, and then spread it out with a rolling pin.
Research results show that DiffSkill is significantly better than other technologies in solving long-term, multi-tool tasks using only sensory information. Experiments show that after being well trained, DiffSkill's planner can find a good intermediate state between the initial state and the target state, and find a suitable skill sequence to solve the task.
#DiffSkill’s planner can predict intermediate steps very accurately
Lin Xingyu said, “One of the main points is that a set of skills can provide a very important temporal abstraction that allows us to reason over the long term. This is also similar to the way humans deal with different tasks: thinking in different temporal abstractions, and It’s not about thinking about what to do next second.”
However, DiffSkill’s capacity is also limited. For example, DiffSkill's performance dropped significantly when performing one of the tasks requiring three-stage planning (although it still outperformed other techniques). Lin Xingyu also mentioned that in some cases, the feasibility predictor can produce false positives. The researchers believe that learning better latent spaces can help solve this problem.
The researchers are also exploring other directions for improving DiffSkill, including a more efficient planning algorithm that can be used for longer tasks.
Lin Xingyu expressed the hope that one day, he can use DiffSkill on a real pizza-making robot. He said, "We are still far from that. There are various challenges in control, sim2real transfer and security. But we are now more confident to try to launch some long-term missions."
Original text Title: This deep learning technique solves one of the tough challenges of robotics, Author: Ben Dickson
The above is the detailed content of How deep learning technology solves the problem of robots handling deformable objects. For more information, please follow other related articles on the PHP Chinese website!