search
HomeTechnology peripheralsAIAdapting to multiple forms and tasks, the most powerful open source robot learning system 'Octopus' was born

In terms of robot learning, a common approach is to collect data sets specific to a specific robot and task, and then use them to train a policy. However, if this method is used to learn from scratch, sufficient data needs to be collected for each task, and the generalization ability of the resulting policy is usually poor.

"In principle, experience gathered from other robots and tasks can provide possible solutions, allowing the model to see a variety of robot control problems that may be able to Improving the generalization ability and performance of robots on downstream tasks. However, even if general models that can handle a variety of natural language and computer vision tasks have emerged, it is still difficult to build a "universal robot model."

It is very difficult to train a unified control strategy for a robot, which involves many difficulties, including operating different robot bodies, sensor configurations, action spaces, task specifications, environments and computing budgets.

In order to achieve this goal, some research results related to "robot basic model" have appeared; their approach is to directly map robot observations into actions, and then generalize through zero-sample sample solutions to new areas or new robots. These models are often referred to as "generalist robot policies," or GRPs, which emphasize the robot's ability to perform low-level visuomotor control across a variety of tasks, environments, and robotic systems.

GNM (General Navigation Model) is suitable for a variety of different robot navigation scenarios. RoboCat can operate different robot bodies according to mission goals, and RT-X can be operated through language Five different robot bodies. Although these models are indeed an important advance, they also suffer from multiple limitations: their input observations are often predefined and often limited (such as a single camera input video stream); they are difficult to effectively fine-tune to new domains; in these models The largest versions are not available for people to use (this is important).

Recently, the Octo Model Team composed of 18 researchers from the University of California, Berkeley, Stanford University, Carnegie Mellon University and Google DeepMind released their groundbreaking research results: Octo model. This project effectively overcomes the above limitations.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

  • Paper title: Octo: An Open-Source Generalist Robot Policy
  • Paper address: https://arxiv.org/pdf/2405.12213
  • Open source project: https://octo-models. github.io/

They designed a system that allows GRP to more easily cope with the interface diversity problem of downstream robot applications.

The core of this model is the Transformer architecture, which can map any input token (created based on observations and tasks) into an output token (then encoded into an action), and this architecture can be used in a variety of ways ized robot and task data sets for training. The policy can accept different camera configurations without additional training, can control different robots, and can be guided by verbal commands or target images—all by simply changing the tokens input to the model.

Most importantly, the model can also adapt to new robot configurations with different sensor inputs, operating spaces, or robot morphologies. All that is required is to adopt the appropriate adapter and use a Fine-tuning with small target domain datasets and small computational budgets.

Not only that, Octo has also completed pre-training on the largest robot manipulation data set to date - this data set contains 800,000 robots from the Open X-Embodiment data set Demo. Octo is not only the first GRP to be efficiently fine-tuned to new observation and action spaces, it is also the first generalist robot manipulation strategy that is fully open source (training workflow, model checkpoints, and data). The team also highlighted in the paper the unique and innovative nature of its combined Octo components.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

Octo model

Let’s take a look at how Octo, the open source generalist robot strategy, is built of. Overall, Octo is designed to be a flexible and broadly applicable generalist robotics strategy that can be used by a number of different downstream robotics applications and research projects.

Architecture

The core of Octo is based on the Transformer strategy π. It contains three key parts: the input tokenizer, the Transformer backbone network, and the readout head.

As shown in Figure 2, the function of the input tokenizer is to convert language instructions, targets and observation sequences into tokens. The Transformer backbone will process these tokens into embeddings and read out the headers. Then the desired output is obtained, which is the action.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

Task and observation tokenizer

In order to define the task (such as language instructions and target images ) and observations (such as camera video streams) into commonly used tokenized formats. The team uses different tokenizers for different modalities:

For language input, first tokenization, and then process it into a language embedding token sequence through a pre-trained Transformer. Specifically, the model they used is t5-base (111M).

For image observations and targets, they are processed through a shallower convolution stack and then split into a sequence of flattened tiles.

Finally, the Transformer’s input sequence is constructed by adding learnable positional embeddings to task and observation tokens and arranging them in a certain order.

Transformer backbone and readout head

After processing the input into a unified token sequence, it can be handed over to Transformer for processing. This is similar to previous research work on training Transformer-based policies based on observations and action sequences.

Octo's attention mode is block-by-block masking: observation tokens can only pay attention to tokens and task tokens from the same or previous time steps according to the causal relationship. Tokens corresponding to non-existent observations are completely masked (such as data sets without language instructions). This modular design makes it easy to add or remove observations or tasks during the fine-tuning phase.

In addition to these input token modules, the team also inserted learned readout tokens. The readout token will pay attention to its previous observation and task tokens, but will not be paid attention to by any observation or task token. Therefore, readout tokens can only read and process the internal embedding, but cannot affect the internal embedding. The readout token acts similarly to the [CLS] token in BERT, acting as a compact vector embedding of the sequence of observations so far. For the embedding of read tokens, a lightweight "action header" that implements the diffusion process will be used. This action header predicts a "chunk" of multiple consecutive actions.

This design allows users to flexibly add new tasks and observation input or action output headers to the model during downstream fine-tuning. When adding new tasks, observations, or loss functions downstream, you can retain the Transformer's pretrained weights as a whole and only add new positional embeddings, a new lightweight encoder, or new headers necessary due to specification changes. parameter. This differs from previous architectures, which required reinitialization or retraining of numerous components of the pretrained model if image inputs were added or removed or task specifications changed.

To make Octo a true "generalist" model, this flexibility is crucial: since it is impossible to cover all possible robot sensor and action configurations in the pre-training stage, , if the inputs and outputs of Octo can be adjusted during the fine-tuning phase, it will make it a versatile tool for the robotics community. Additionally, previous model designs that used a standard Transformer backbone or fused a visual encoder with an MLP output head fixed the type and order of model inputs. In contrast, switching Octo's observations or tasks does not require reinitialization of much of the model.

Training data

The team took a mix of 25 datasets from Open X-Embodiment data set. Figure 3 gives the composition of the data set.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

Please refer to the original paper for more details on training objectives and training hardware configuration.

Model checkpoints and code

Here comes the point! The team not only published Octo's paper, but also fully open sourced all resources, including:

  • The pre-trained Octo checkpoints include Octo-Small with 27 million parameters and Octo-Base with 93 million parameters.
  • Fine-tuning script for Octo models, based on JAX.
  • Model pre-training workflow for pre-training Octo on the Open X-Embodiment dataset, based on JAX. Data loader for Open X-Embodiment data, compatible with JAX and PyTorch.

Experiment

The team also conducted an empirical analysis of Octo through experiments, evaluating it as a robot in multiple dimensions Performance of the basic model:

  1. Can I use Octo directly to control multiple robot bodies and solve language and target tasks?
  2. Can Octo weights serve as a good initialization basis to support data-efficient fine-tuning for new tasks and robots, and are they superior to training-from-scratch methods and commonly used pre-trained representations?
  3. Which design decision in Octo is most important in building a generalist robot strategy?

Figure 4 shows the 9 tasks for evaluating Octo.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

Use Octo directly to control multiple robots

The team compared The zero-sample control capabilities of Octo, RT-1-X, and RT-2-X are shown in Figure 5.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

It can be seen that the success rate of Octo is 29% higher than RT-1-X (35 million parameters). In the WidowX and RT-1 Robot evaluation, the performance of Octo is equivalent to that of RT-2-X with 55 billion parameters.

In addition, RT-1-X and RT-2-X only support language commands, while Octo also supports conditional on the target image. The team also found that on the WidowX task, success rates were 25% higher when conditioned on target images than when conditioned on language. This may be because target images provide more information about task completion.

Octo can efficiently use data to adapt to new fields

Table 1 gives the data-efficient fine-tuning Experimental results.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

It can be seen that compared to training from scratch or pre-training using pre-trained VC-1 weights, fine-tuning Octo gives better results. good. Across 6 evaluation settings, Octo's average advantage over the second-place baseline is 52%!

And it must be mentioned that for all these evaluation tasks, the recipes and hyperparameters used when fine-tuning Octo were all the same, which shows that the team found a very good default configuration .

Design decisions for generalist robot strategy training

The above results show that Octo can indeed be used as a zero-shot multi-robot control It can also be used as the initialization basis for policy fine-tuning. Next, the team analyzed the impact of different design decisions on the performance of the Octo strategy. Specifically, they focus on the following aspects: model architecture, training data, training objectives, and model size. To do this, they conducted ablation studies.

Table 2 presents the results of the ablation study on model architecture, training data, and training targets.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

Figure 6 shows the impact of model size on the zero-sample success rate. It can be seen that larger models have better visual scene perception. ability.

Adapting to multiple forms and tasks, the most powerful open source robot learning system Octopus was born

Overall, the effectiveness of Octo’s components is proven.

The above is the detailed content of Adapting to multiple forms and tasks, the most powerful open source robot learning system 'Octopus' was born. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Newest Annual Compilation Of The Best Prompt Engineering TechniquesNewest Annual Compilation Of The Best Prompt Engineering TechniquesApr 10, 2025 am 11:22 AM

For those of you who might be new to my column, I broadly explore the latest advances in AI across the board, including topics such as embodied AI, AI reasoning, high-tech breakthroughs in AI, prompt engineering, training of AI, fielding of AI, AI re

Europe's AI Continent Action Plan: Gigafactories, Data Labs, And Green AIEurope's AI Continent Action Plan: Gigafactories, Data Labs, And Green AIApr 10, 2025 am 11:21 AM

Europe's ambitious AI Continent Action Plan aims to establish the EU as a global leader in artificial intelligence. A key element is the creation of a network of AI gigafactories, each housing around 100,000 advanced AI chips – four times the capaci

Is Microsoft's Straightforward Agent Story Enough To Create More Fans?Is Microsoft's Straightforward Agent Story Enough To Create More Fans?Apr 10, 2025 am 11:20 AM

Microsoft's Unified Approach to AI Agent Applications: A Clear Win for Businesses Microsoft's recent announcement regarding new AI agent capabilities impressed with its clear and unified presentation. Unlike many tech announcements bogged down in te

Selling AI Strategy To Employees: Shopify CEO's ManifestoSelling AI Strategy To Employees: Shopify CEO's ManifestoApr 10, 2025 am 11:19 AM

Shopify CEO Tobi Lütke's recent memo boldly declares AI proficiency a fundamental expectation for every employee, marking a significant cultural shift within the company. This isn't a fleeting trend; it's a new operational paradigm integrated into p

IBM Launches Z17 Mainframe With Full AI IntegrationIBM Launches Z17 Mainframe With Full AI IntegrationApr 10, 2025 am 11:18 AM

IBM's z17 Mainframe: Integrating AI for Enhanced Business Operations Last month, at IBM's New York headquarters, I received a preview of the z17's capabilities. Building on the z16's success (launched in 2022 and demonstrating sustained revenue grow

5 ChatGPT Prompts To Stop Depending On Others And Trust Yourself Fully5 ChatGPT Prompts To Stop Depending On Others And Trust Yourself FullyApr 10, 2025 am 11:17 AM

Unlock unshakeable confidence and eliminate the need for external validation! These five ChatGPT prompts will guide you towards complete self-reliance and a transformative shift in self-perception. Simply copy, paste, and customize the bracketed in

AI Is Dangerously Similar To Your MindAI Is Dangerously Similar To Your MindApr 10, 2025 am 11:16 AM

A recent [study] by Anthropic, an artificial intelligence security and research company, begins to reveal the truth about these complex processes, showing a complexity that is disturbingly similar to our own cognitive domain. Natural intelligence and artificial intelligence may be more similar than we think. Snooping inside: Anthropic Interpretability Study The new findings from the research conducted by Anthropic represent significant advances in the field of mechanistic interpretability, which aims to reverse engineer internal computing of AI—not just observe what AI does, but understand how it does it at the artificial neuron level. Imagine trying to understand the brain by drawing which neurons fire when someone sees a specific object or thinks about a specific idea. A

Dragonwing Showcases Qualcomm's Edge MomentumDragonwing Showcases Qualcomm's Edge MomentumApr 10, 2025 am 11:14 AM

Qualcomm's Dragonwing: A Strategic Leap into Enterprise and Infrastructure Qualcomm is aggressively expanding its reach beyond mobile, targeting enterprise and infrastructure markets globally with its new Dragonwing brand. This isn't merely a rebran

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use