search
HomeTechnology peripheralsAIIs the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

In recent research, the input to the vision-language-action (VLA, vision-language-action) model It is basically 2D data and does not integrate the more general 3D physical world.
In addition, existing models perform action prediction by learning "direct mapping of perceived actions", ignoring the dynamics of the world and the relationship between actions and dynamics.
In contrast, when humans think, they introduce world models, which can describe their imagination of future scenarios and plan their next actions.
To this end, researchers from the University of Massachusetts Amherst, MIT and other institutions have proposed the 3D-VLA model. By introducing a new class of embodied foundation models, the generated world can be Models seamlessly connect 3D perception, reasoning and action.
Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

Project homepage: https://vis-www.cs.umass .edu/3dvla/

Paper address: https://arxiv.org/abs/2403.09631

Specifically, 3D-VLA Built on a 3D-based large language model (LLM) and introducing a set of interaction tokens to participate in embodied environments.

The Ganchuang team trained a series of embodied diffusion models, injecting generative capabilities into the models and aligning them into LLM to predict target images and point clouds.

In order to train the 3D-VLA model, we extracted a large amount of 3D related information from existing robot datasets and constructed a huge 3D embodied instruction dataset.

The research results show that 3D-VLA performs well in handling reasoning, multi-modal generation and planning tasks in embodied environments, which highlights its potential application in real-world scenarios value.

3D Embodied Instruction Tuning Dataset

Due to the billions of data sets on the Internet, VLM performs in multiple tasks It delivers excellent performance, and the million-level video action data set also lays the foundation for specific VLM for robot control.

However, most of the current datasets cannot provide sufficient depth or 3D annotation and precise control for robot operation. This requires the content of 3D spatial reasoning and interaction to be included in the data set. The lack of 3D information makes it difficult for robots to understand and execute instructions that require 3D spatial reasoning, such as "Put the farthest cup in the middle drawer."

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

To bridge this gap, the researchers constructed a large-scale 3D instruction tuning data set, which provides sufficient "3D related information" and "corresponding text instructions" to train the model.

The researchers designed a pipeline to extract 3D language action pairs from existing embodied datasets, obtaining point clouds, depth maps, 3D bounding boxes, 7D actions of the robot, and text descriptions label.

3D-VLA base model

3D-VLA is a world model for three-dimensional reasoning, goal generation and decision-making in an embodied environment .

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

First build the backbone network on top of 3D-LLM, and further enhance the model's ability to interact with the 3D world by adding a series of interactive tokens; Then by pre-training the diffusion model and using projection to align the LLM and diffusion models, the target generation capability is injected into the 3D-VLA

backbone network

In the first stage, the researchers developed the 3D-VLA base model following the 3D-LLM method: since the collected data set did not reach the billion-level scale required to train multi-modal LLM from scratch, Multi-view features need to be used to generate 3D scene features so that visual features can be seamlessly integrated into pre-trained VLM without adaptation.

At the same time, the training data set of 3D-LLM mainly includes objects and indoor scenes, which are not directly consistent with the specific settings, so the researchers chose to use BLIP2-PlanT5XL as the pre-training model .

During the training process, unfreeze the input and output embeddings of the token, and the weights of the Q-Former.

Interaction tokens

In order to enhance the model’s understanding of the 3D scene and the interaction in the environment, the researchers introduced A new set of interactive tokens

First, object tokens are added to the input, including object nouns in parsed sentences (such as a chocolate bar [loc tokens] on the table) so that the model can better capture the objects being manipulated or mentioned.

Secondly, in order to better express spatial information in language, the researchers designed a set of location tokens , using six marks in the form of AABB to represent the three-dimensional boundary frame.

Third, in order to better perform dynamic encoding, is introduced in the framework to include the embedding of static scenes: by combining scene tokens, 3D-VLA can understand dynamic scenes, and manages the input of interlaced 3D scenes and text.

The architecture is further enhanced by extending the set of specialized tags that represent robot actions. The robot's action has 7 degrees of freedom. Discrete tokens such as , and are used to represent the predetermined absolute position, rotation and gripper opening of the arm. Each action Separated by token.

Inject goal generation capabilities

Humans can pre-visualize the final state of the scene, Improving the accuracy of action prediction or decision-making is also a key aspect of building a world model; in preliminary experiments, the researchers also found that providing a realistic final state can enhance the model's reasoning and planning capabilities.

But training MLLM to generate images, depth and point clouds is not simple:

First, video diffusion models are not designed for embodied scenes Tailor-made, for example, when Runway generates future frames of "open drawer", problems such as view changes, object deformation, weird texture replacement, and layout distortion will occur in the scene.

Moreover, how to integrate diffusion models of various modes into a single basic model is still a difficult problem.

So the new framework proposed by the researchers first pre-trains the specific diffusion model based on different forms such as images, depth and point clouds, and then uses the decoder of the diffusion model in the alignment stage. Aligned to the embedding space of 3D-VLA.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

Experimental results

3D-VLA is a multifunctional, 3D-based generative world model that can be used in the 3D world In performing reasoning and localization, imagining multi-modal target content, and generating actions for robot operation, the researchers mainly evaluated 3D-VLA from three aspects: 3D reasoning and localization, multi-modal target generation, and embodied action planning. .

3D Inference and Localization

3D-VLA outperforms all 2D VLM methods on language reasoning tasks, study Personnel attributed this to the leverage of 3D information, which provides more accurate spatial information for reasoning.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

In addition, since the dataset contains a set of 3D positioning annotations, 3D-VLA learns to locate relevant objects, helping the model to focus more on key objects for reasoning.

The researchers found that 3D-LLM performed poorly on these robotic inference tasks, demonstrating the necessity of collecting and training on robotics-related 3D datasets.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

And 3D-VLA performed significantly better than the 2D baseline method in positioning performance. This finding also provides evidence for the effectiveness of the annotation process. Convincing evidence helps the model gain powerful 3D positioning capabilities.

Multi-modal target generation

Compared with existing zero-shot generation methods for migration to the robotics field, 3D-VLA achieves better results in most metrics. The good performance confirms the importance of using "datasets specifically designed for robotic applications" to train world models.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

Even in direct comparisons with Instruct-P2P*, 3D-VLA consistently performs better, and the results show that integrating large language models In 3D-VLA, robot operation instructions can be understood more comprehensively and deeply, thereby improving the target image generation performance.

Additionally, a slight performance degradation can be observed when excluding predicted bounding boxes from the input prompt, confirming the effectiveness of using intermediate predicted bounding boxes to aid model understanding The entire scene allows the model to allocate more attention to the specific objects mentioned in a given instruction, ultimately enhancing its ability to imagine the final target image.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

In the comparison of results generated by point clouds, 3D-VLA with intermediate predicted bounding boxes performs best, confirming the importance of understanding instructions and scenes. Contextualize the importance of combining large language models with precise object localization.

Embodied Action Planning

3D-VLA exceeds the baseline in most tasks in RLBench action prediction The performance of the model shows its planning capabilities.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

It is worth noting that the baseline model requires the use of historical observations, object status and current status information, while the 3D-VLA model only executes through open-loop control.

Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones

In addition, the generalization ability of the model was demonstrated in the pick-up-cup task, and 3D-VLA was used in CALVIN Better results were also achieved, an advantage the researchers attributed to the ability to locate objects of interest and imagine goal states, providing rich information for inferring actions.

The above is the detailed content of Is the 3D version of Sora coming? UMass, MIT and others propose 3D world models, and embodied intelligent robots achieve new milestones. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
为何在自动驾驶方面Gaussian Splatting如此受欢迎,开始放弃NeRF?为何在自动驾驶方面Gaussian Splatting如此受欢迎,开始放弃NeRF?Jan 17, 2024 pm 02:57 PM

写在前面&笔者的个人理解三维Gaussiansplatting(3DGS)是近年来在显式辐射场和计算机图形学领域出现的一种变革性技术。这种创新方法的特点是使用了数百万个3D高斯,这与神经辐射场(NeRF)方法有很大的不同,后者主要使用隐式的基于坐标的模型将空间坐标映射到像素值。3DGS凭借其明确的场景表示和可微分的渲染算法,不仅保证了实时渲染能力,而且引入了前所未有的控制和场景编辑水平。这将3DGS定位为下一代3D重建和表示的潜在游戏规则改变者。为此我们首次系统地概述了3DGS领域的最新发展和关

了解 Microsoft Teams 中的 3D Fluent 表情符号了解 Microsoft Teams 中的 3D Fluent 表情符号Apr 24, 2023 pm 10:28 PM

您一定记得,尤其是如果您是Teams用户,Microsoft在其以工作为重点的视频会议应用程序中添加了一批新的3DFluent表情符号。在微软去年宣布为Teams和Windows提供3D表情符号之后,该过程实际上已经为该平台更新了1800多个现有表情符号。这个宏伟的想法和为Teams推出的3DFluent表情符号更新首先是通过官方博客文章进行宣传的。最新的Teams更新为应用程序带来了FluentEmojis微软表示,更新后的1800表情符号将为我们每天

选择相机还是激光雷达?实现鲁棒的三维目标检测的最新综述选择相机还是激光雷达?实现鲁棒的三维目标检测的最新综述Jan 26, 2024 am 11:18 AM

0.写在前面&&个人理解自动驾驶系统依赖于先进的感知、决策和控制技术,通过使用各种传感器(如相机、激光雷达、雷达等)来感知周围环境,并利用算法和模型进行实时分析和决策。这使得车辆能够识别道路标志、检测和跟踪其他车辆、预测行人行为等,从而安全地操作和适应复杂的交通环境.这项技术目前引起了广泛的关注,并认为是未来交通领域的重要发展领域之一。但是,让自动驾驶变得困难的是弄清楚如何让汽车了解周围发生的事情。这需要自动驾驶系统中的三维物体检测算法可以准确地感知和描述周围环境中的物体,包括它们的位置、

Windows 11中的Paint 3D:下载、安装和使用指南Windows 11中的Paint 3D:下载、安装和使用指南Apr 26, 2023 am 11:28 AM

当八卦开始传播新的Windows11正在开发中时,每个微软用户都对新操作系统的外观以及它将带来什么感到好奇。经过猜测,Windows11就在这里。操作系统带有新的设计和功能更改。除了一些添加之外,它还带有功能弃用和删除。Windows11中不存在的功能之一是Paint3D。虽然它仍然提供经典的Paint,它对抽屉,涂鸦者和涂鸦者有好处,但它放弃了Paint3D,它提供了额外的功能,非常适合3D创作者。如果您正在寻找一些额外的功能,我们建议AutodeskMaya作为最好的3D设计软件。如

单卡30秒跑出虚拟3D老婆!Text to 3D生成看清毛孔细节的高精度数字人,无缝衔接Maya、Unity等制作工具单卡30秒跑出虚拟3D老婆!Text to 3D生成看清毛孔细节的高精度数字人,无缝衔接Maya、Unity等制作工具May 23, 2023 pm 02:34 PM

ChatGPT给AI行业注入一剂鸡血,一切曾经的不敢想,都成为如今的基操。正持续进击的Text-to-3D,就被视为继Diffusion(图像)和GPT(文字)后,AIGC领域的下一个前沿热点,得到了前所未有的关注度。这不,一款名为ChatAvatar的产品低调公测,火速收揽超70万浏览与关注,并登上抱抱脸周热门(Spacesoftheweek)。△ChatAvatar也将支持从AI生成的单视角/多视角原画生成3D风格化角色的Imageto3D技术,受到了广泛关注现行beta版本生成的3D模型,

自动驾驶3D视觉感知算法深度解读自动驾驶3D视觉感知算法深度解读Jun 02, 2023 pm 03:42 PM

对于自动驾驶应用来说,最终还是需要对3D场景进行感知。道理很简单,车辆不能靠着一张图像上得到感知结果来行驶,就算是人类司机也不能对着一张图像来开车。因为物体的距离和场景的和深度信息在2D感知结果上是体现不出来的,而这些信息才是自动驾驶系统对周围环境作出正确判断的关键。一般来说,自动驾驶车辆的视觉传感器(比如摄像头)安装在车身上方或者车内后视镜上。无论哪个位置,摄像头所得到的都是真实世界在透视视图(PerspectiveView)下的投影(世界坐标系到图像坐标系)。这种视图与人类的视觉系统很类似,

跨模态占据性知识的学习:使用渲染辅助蒸馏技术的RadOcc跨模态占据性知识的学习:使用渲染辅助蒸馏技术的RadOccJan 25, 2024 am 11:36 AM

原标题:Radocc:LearningCross-ModalityOccupancyKnowledgethroughRenderingAssistedDistillation论文链接:https://arxiv.org/pdf/2312.11829.pdf作者单位:FNii,CUHK-ShenzhenSSE,CUHK-Shenzhen华为诺亚方舟实验室会议:AAAI2024论文思路:3D占用预测是一项新兴任务,旨在使用多视图图像估计3D场景的占用状态和语义。然而,由于缺乏几何先验,基于图像的场景

《原神》:知名原神3d同人作者被捕《原神》:知名原神3d同人作者被捕Feb 15, 2024 am 09:51 AM

一些原神“奇怪”的关键词,在这两天很有关注度,明明搜索指数没啥变化,却不断有热议话题蹦窜。例如了龙王、钟离等“转变”立绘激增,虽在网络上疯传了一阵子,但是经过追溯发现这些是合理、常规的二创同人。如果单是这些,倒也翻不起多大的热度。按照一部分网友的说法,除了原神自身就有热度外,发现了一件格外醒目的事情:原神3d同人作者shirakami已经被捕。这引发了不小的热议。为什么被捕?关键词,原神3D动画。还是越过了线(就是你想的那种),再多就不能明说了。经过多方求证,以及新闻报道,确实有此事。自从去年发

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.