Home > Article > Technology peripherals > Google robots achieve interactive language with an accuracy of 93.5%, and the amount of open source data increases tenfold.
Look carefully, the man in front of you is constantly giving natural language instructions to a robot, such as "Push the green star between the red blocks", "Move the blue block to the lower left Corner", the robot can complete every input command in real time.
Since the 1960s, robotics experts have been trying to make robots understand people's "natural language instructions" and perform specific actions.
Ideally, future robots will react in real time to any relevant task that users can describe in natural language.
Especially in open human environments, users may need to customize the behavior of the robot when it occurs, providing quick corrections, such as "stop, move the arm up a little" or specify Limit "Move slowly to the right".
In addition, real-time languages can make it easier for people and robots to collaborate on complex long-term tasks, and people can guide robots iteratively and interactively Operation, there will occasionally be verbal feedback.
The current related work can be roughly divided into the following three parts:
1. The robot body needs to exist in the real world;
2. Able to respond to a large number of rich natural language commands;
3. Able to execute interactive (interactive) language commands , that is, the robot needs to accept new natural language instructions during task execution.
Regarding the third point, the current interactive development speed in the field of robots is still very slow, which also makes robots lack a "sense of life".
Recently Google published a paper proposing a brand new framework that can produce real-world, real-time interactive robots that execute natural language instructions, as well as related data sets and environments. , benchmarks and strategies are all available.
##Paper link: https://arxiv.org/pdf/2210.06407.pdf
Project homepage: https://interactive-language.github.io/
Through a data set of hundreds of thousands of language annotation trajectories Conducting behavior cloning training, the resulting policy can skillfully execute an order of magnitude more commands than previous work achieved. In the real world, the researchers estimated that the method had a 93.5% success rate on 87,000 different natural language strings.
# And the same strategy can be guided by humans in real time via natural language to solve a wide range of precise long-distance rearrangement goals, such as "using Make a smiling face with building blocks" etc.
The data set released with the paper includes nearly 600,000 language-tagged trajectories, which is an order of magnitude larger than previously available data sets.
Interactive Language: Real-time Conversation with the RobotTo integrate the robot into the real world, the most important thing is to be able to process open natural language instructions, but from the machine From a learning perspective, getting robots to learn open vocabulary languages is a huge challenge.
Open representative models need to perform a large number of tasks, including small corrective instructions, etc. Existing multi-task learning setups leverage carefully designed imitation learning datasets or complex reinforcement learning reward functions to drive learning for each task, and predefined sets designed in this way are bound to not be very large.
Therefore, a key question in the open vocabulary task is: how to extend the collection of robot data to cover thousands of actions in real environments, and How do you connect all of this behavior to the natural language instructions that the end user might actually provide?
In interactive languages, the key to the large-scale simulation learning framework proposed by Google is the scalability of creating large, multi-language conditional robot demonstration data sets.
Unlike the previous setup where all skills were defined and then a curated demonstration of each skill was collected, the researchers continued to work across multiple robots without scene resets. ) or low level skill segmentation.
All data, including failed data (such as knocking blocks off a table), must go through a HindSight language relabeling process before being paired with text.
In this process, annotators need to watch long robot videos to identify as many behaviors as possible, mark the start and end time of each behavior, and use unlimited forms of Natural language to describe each fragment.
The most important thing is that compared to the previous set of bootstrapping, all skills used for training are revealed bottom-up from the data itself, rather than being pre-set by researchers. definite.
#The researchers intentionally made the learning method and architecture as simple as possible. The Robot Policy Network is a cross-attention Transformer that combines 5 Hz video and text. Mapping to 5 Hz robot motion, the target is cloned using standard supervised learning behavior without auxiliary losses.
While testing, new natural language commands can be sent into the policy network via speech-to-text at rates up to 5 Hz.
During the annotation process, the researchers collected a Language-Table dataset containing more than 440,000 actual and 180,000 simulated robot executions of natural Demonstration of language commands, and the sequence of actions taken by the robot during the demonstration.
This is also currently the largest language-conditioned robot demonstration data set, directly improved by an order of magnitude.
Language-Table has launched a simulation learning benchmark, which can be used for model selection or to evaluate the ability of robots trained by different methods to execute instructions.
In experiments, researchers found that robots are particularly powerful when they can follow natural language instructions input in real time. .
On the project website, the researchers demonstrate that users can guide the robot through complex long-horizon sequences to solve long-term problems using only natural language. The goal of precise coordinated control.
For example, if there are many blcoks on the table, the command can be "Make a smiley face with green eyes" or "Place them all in a vertical line "Up" and so on.
Because the robot was trained to follow open-vocabulary language, experiments saw the robot respond to a range of different verbal corrections, such as "Gently to the right." Move the red star".
Finally, the researchers explored the advantages of real-time language, such as making robot data collection more efficient. A human operator can control four robots at the same time using spoken language. It is possible Scaling robot data collection in the future without having to equip each robot with an annotator.
Although the project is currently limited to a fixed set of objects on the desktop, the experimental results of the interactive language can initially show that large-scale imitation learning can indeed produce real-time interactive A bot capable of following free-form end-user commands.
In order to promote the advancement of real-time language control technology for physical robots, researchers have open sourced Language-Table, which is currently the largest real-world robot demonstration data set based on language conditions. It can also be used as Related simulation benchmarks.
The researchers believe that the role of this data set may not only be limited to the field of robot control, but may also be used to study language and action conditional video prediction, robot video conditional language modeling, or in It provides a new starting point for studying many other interesting and active problems in the broader machine learning context.
The above is the detailed content of Google robots achieve interactive language with an accuracy of 93.5%, and the amount of open source data increases tenfold.. For more information, please follow other related articles on the PHP Chinese website!