Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient-AI-php.cn

Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 16, 2024 am 11:27 AM

projectDemand driven navigation

Imagine if a robot could understand your needs and work hard to meet them, wouldn’t it be great?

If you want a robot to help you, you usually need to give a more precise command, but the actual implementation of the command may not be ideal. If we consider the real environment, when the robot is asked to find a specific item, the item may not actually exist in the current environment, and the robot cannot find it anyway; but is it possible that there is another item in the environment, which is related to the user? Does the requested item have similar functions and can also meet the user's needs? This is the benefit of using "requirements" as task instructions.

Recently, Peking University Dong Hao’s team proposed a new navigation task - Demand-driven Navigation (DDN), has been accepted by NeurIPS 2023. In this task, the robot is required to find items that meet the user's needs based on a demand instruction given by the user. At the same time, Dong Hao's team also proposed learning the attribute characteristics of items based on demand instructions, which effectively improved the success rate of the robot in finding items.

Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

Paper address: https://arxiv.org/pdf/2309.08138.pdf
Project homepage: https://sites.google.com/view/demand-driven-navigation/home

^{The robot will receive a demand command, such as "I'm hungry", "I'm thirsty" ", then the robot needs to find an item in the scene that can meet the need. Therefore, demand-driven navigation is essentially a task of finding items, and there has been a similar task before - visual object navigation (Visual Object Navigation). The difference between these two tasks is that the former is to tell the robot "what are my needs", and the latter is to tell the robot "what items I want".}

Using needs as instructions means that the robot needs to reason about the content of the instructions and explore the types of items in the current scene before it can find items that meet the user's needs. From this point of view, demand-driven navigation is much more difficult than visual item navigation. Although the difficulty has increased, once the robot learns to find items according to demand instructions, there are still many benefits. For example:

Users only need to give instructions according to their own needs, without considering what is in the scene.

Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

Using needs as instructions can increase the probability of user needs being met. For example, when you are "thirsty", asking the robot to find "tea" and asking the robot to find "items that can quench your thirst" obviously have a wider scope in the latter.

Requirements described in natural language have a larger description space and can put forward more precise and precise requirements.
In order to train such a robot, it is necessary to establish a mapping relationship between demand instructions and items so that the environment can give training signals. In order to reduce costs, Dong Hao's team proposed a "semi-automatic" generation method based on a large language model: first use GPT-3.5 to generate needs that can be met by items existing in the scene, and then manually filter out those that do not meet the requirements.
Algorithm design

Considering that items that can meet the same needs have similar attributes, if the characteristics of the attributes of such items can be learned, the robot seems to be able to use these attribute characteristics to find items. For example, for the requirement "I am thirsty", the required items should have the attribute of "quenching thirst", and "juice" and "tea" both have this attribute. What needs to be noted here is that an item may exhibit different attributes under different needs. For example, "water" can exhibit both the attribute of "cleaning clothes" (under the requirement of "washing clothes") and Expose the attribute of "quenching thirst" (under the requirement of "I am thirsty").

Attribute learning stage

So, how to make the model understand the needs of "quenching thirst" and "cleaning clothes"? It is a relatively stable common sense to note the attributes displayed by items under certain needs. In recent years, with the gradual rise of large language models (LLM), the understanding of common sense of human society demonstrated by LLM is amazing. Therefore, Peking University Dong Hao’s team decided to learn this common sense from LLM. They first asked LLM to generate a lot of demand instructions (called Language-grounding Demand, LGD in the figure), and then asked LLM which items can satisfy these demand instructions (called Language-grounding Object, LGO in the figure).

It should be noted here that the prefix Language-grounding emphasizes that these demand/objects can be obtained from LLM and does not depend on a specific scenario; World-grounding in the figure below emphasizes these demand/objects Object is closely integrated with a specific environment (such as ProcThor, Replica and other scene data sets).

Then in order to obtain the properties of LGO under LGD, the authors used BERT to encode LGD, CLIP-Text-Encoder to encode LGO, and then spliced them to obtain Demand-object Features. Noting that there was a "similarity" when introducing the attributes of items at the beginning, the authors used this similarity to define "positive and negative samples" and then used contrastive learning to train "item attributes". Specifically, for two spliced Demand-object Features, if the items corresponding to the two features can meet the same requirement, then the two features are positive samples of each other (for example, both item a and item b in the picture are can meet the requirement D1, then DO1-a and DO1-b are positive samples of each other); any other splicing is negative samples of each other. After the authors input the Demand-object Features into an Attribute Module of the TransformerEncoder architecture, they trained with InfoNCE Loss.

Navigation strategy learning phase

Through comparative learning, the Attribute Module has learned the common sense provided by LLM. In the navigation strategy learning phase, the parameters of the Attribute Module are directly imported, and then the A* algorithm is learned using imitation learning. Collected tracks. At a certain time step, the author uses the DETR model to segment the items in the current field of view to obtain the World-grounding Object, which is then encoded by CLIP-Visual-Endocer. Other processes are similar to the attribute learning stage. Finally, the BERT features, global image features, and attribute features of the required instructions are spliced, fed into a Transformer model, and finally an action is output.

It is worth noting that the authors used CLIP-Text-Encoder in the attribute learning stage, and in the navigation policy learning stage, the authors used CLIP-Visual-Encoder. Here, the powerful visual and text alignment capabilities of the CLIP model are cleverly used to transfer the text common sense learned from LLM to the vision at each time step.

Experimental results

The experiment was conducted on the AI2Thor simulator and ProcThor data sets. The experimental results show that this method is significantly higher than previous variants of various visual item navigation algorithms and algorithms supported by large language models.

VTN is a closed-vocabulary navigation algorithm that can only perform navigation tasks on preset items. The authors have made some variations of its algorithm. However, whether the BERT features of the required instructions are used as input or the GPT parsing results of the instructions are used as input, the results of the algorithm are not very ideal. When switching to ZSON, an open-vocabulary navigation algorithm, due to the poor alignment effect of CLIP between demand instructions and pictures, several variants of ZSON cannot complete demand driving well. Navigation tasks. However, some algorithms based on heuristic search + LLM have low exploration efficiency due to the large scene area of the Procthor data set, and their success rate is not very high. Pure LLM algorithms, such as GPT-3-Prompt and MiniGPT-4, exhibit poor reasoning capabilities for unseen locations in the scene, resulting in inability to efficiently discover items that meet the requirements.

Ablation experiments show that Attribute Module significantly improves navigation success rate. The authors show that the t-SNE graph well demonstrates that the Attribute Module successfully learns the attribute features of items through demand-conditioned contrastive learning. After replacing the Attribute Module architecture with MLP, the performance dropped, indicating that the TransformerEncoder architecture is more suitable for capturing attribute characteristics. BERT can well extract the characteristics of required instructions, which improves the generalization of unseen instructions.

Here are some visualizations:

The corresponding author of this study, Dr. Dong Hao, is currently an assistant professor at the Frontier Computing Research Center of Peking University, a doctoral supervisor, and a liberal arts youth Scholar and intellectual scholar, he founded and led the Peking University Hyperplane Lab in 2019. He has published more than 40 papers in top international conferences/journals such as NeurIPS, ICLR, CVPR, ICCV, ECCV, etc. Google Scholar It has been cited more than 4,700 times and has won the ACM MM Best Open Source Software Award and the OpenI Outstanding Project Award. He has also served as the field chairperson and deputy editorial board member of top international conferences such as NeurIPS, CVPR, AAAI, and ICRA for many times, undertaken a number of national and provincial projects, and chaired the Ministry of Science and Technology’s New Generation Artificial Intelligence 2030 major project.

The first author of the paper, Wang Hongzhen, is currently a second-year doctoral student at the School of Computer Science, Peking University. His research interests focus on robotics, computer vision and psychology. He hopes to start from the aspects of human behavior, cognition and motivation to align the connection between humans and robots.

^{Reference links:}

^{[1] https://zsdonghao.github.io/}

^{[2] https://whcpumpkin.github.io/}

The above is the detailed content of Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

4090生成器：与A100平台相比，token生成速度仅低于18%，上交推理引擎赢得热议Dec 21, 2023 pm 03:25 PM

PowerInfer提高了在消费级硬件上运行AI的效率上海交大团队最新推出了超强CPU/GPULLM高速推理引擎PowerInfer。PowerInfer和llama.cpp都在相同的硬件上运行，并充分利用了RTX4090上的VRAM。这个推理引擎速度有多快？在单个NVIDIARTX4090GPU上运行LLM，PowerInfer的平均token生成速率为13.20tokens/s，峰值为29.08tokens/s，仅比顶级服务器A100GPU低18%，可适用于各种LLM。PowerInfer与

思维链CoT进化成思维图GoT，比思维树更优秀的提示工程技术诞生了Sep 05, 2023 pm 05:53 PM

要让大型语言模型（LLM）充分发挥其能力，有效的prompt设计方案是必不可少的，为此甚至出现了promptengineering（提示工程）这一新兴领域。在各种prompt设计方案中，思维链（CoT）凭借其强大的推理能力吸引了许多研究者和用户的眼球，基于其改进的CoT-SC以及更进一步的思维树（ToT）也收获了大量关注。近日，苏黎世联邦理工学院、Cledar和华沙理工大学的一个研究团队提出了更进一步的想法：思维图（GoT）。让思维从链到树到图，为LLM构建推理过程的能力不断得到提升，研究者也通

复旦NLP团队发布80页大模型Agent综述，一文纵览AI智能体的现状与未来Sep 23, 2023 am 09:01 AM

近期，复旦大学自然语言处理团队（FudanNLP）推出LLM-basedAgents综述论文，全文长达86页，共有600余篇参考文献！作者们从AIAgent的历史出发，全面梳理了基于大型语言模型的智能代理现状，包括：LLM-basedAgent的背景、构成、应用场景、以及备受关注的代理社会。同时，作者们探讨了Agent相关的前瞻开放问题，对于相关领域的未来发展趋势具有重要价值。论文链接：https://arxiv.org/pdf/2309.07864.pdfLLM-basedAgent论文列表：

FATE 2.0发布：实现异构联邦学习系统互联Jan 16, 2024 am 11:48 AM

FATE2.0全面升级，推动隐私计算联邦学习规模化应用FATE开源平台宣布发布FATE2.0版本，作为全球领先的联邦学习工业级开源框架。此次更新实现了联邦异构系统之间的互联互通，持续增强了隐私计算平台的互联互通能力。这一进展进一步推动了联邦学习与隐私计算规模化应用的发展。FATE2.0以全面互通为设计理念，采用开源方式对应用层、调度、通信、异构计算（算法）四个层面进行改造，实现了系统与系统、系统与算法、算法与算法之间异构互通的能力。FATE2.0的设计兼容了北京金融科技产业联盟的《金融业隐私计算

吞吐量提升5倍，联合设计后端系统和前端语言的LLM接口来了Mar 01, 2024 pm 10:55 PM

大型语言模型(LLM)被广泛应用于需要多个链式生成调用、高级提示技术、控制流以及与外部环境交互的复杂任务。尽管如此，目前用于编程和执行这些应用程序的高效系统却存在明显的不足之处。研究人员最近提出了一种新的结构化生成语言（StructuredGenerationLanguage），称为SGLang，旨在改进与LLM的交互性。通过整合后端运行时系统和前端语言的设计，SGLang使得LLM的性能更高、更易控制。这项研究也获得了机器学习领域的知名学者、CMU助理教授陈天奇的转发。总的来说，SGLang的

大模型也有小偷？为保护你的参数，上交大给大模型制作「人类可读指纹」Feb 02, 2024 pm 09:33 PM

将不同的基模型象征为不同品种的狗，其中相同的「狗形指纹」表明它们源自同一个基模型。大模型的预训练需要耗费大量的计算资源和数据，因此预训练模型的参数成为各大机构重点保护的核心竞争力和资产。然而，与传统软件知识产权保护不同，对预训练模型参数盗用的判断存在以下两个新问题：1）预训练模型的参数，尤其是千亿级别模型的参数，通常不会开源。预训练模型的输出和参数会受到后续处理步骤（如SFT、RLHF、continuepretraining等）的影响，这使得判断一个模型是否基于另一个现有模型微调得来变得困难。无

220亿晶体管，IBM机器学习专用处理器NorthPole，能效25倍提升Oct 23, 2023 pm 03:13 PM

IBM再度发力。随着AI系统的飞速发展，其能源需求也在不断增加。训练新系统需要大量的数据集和处理器时间，因此能耗极高。在某些情况下，执行一些训练好的系统，智能手机就能轻松胜任。但是，执行的次数太多，能耗也会增加。幸运的是，有很多方法可以降低后者的能耗。IBM和英特尔已经试验过模仿实际神经元行为设计的处理器。IBM还测试了在相变存储器中执行神经网络计算，以避免重复访问RAM。现在，IBM又推出了另一种方法。该公司的新型NorthPole处理器综合了上述方法的一些理念，并将其与一种非常精简的计算运行

何恺明和谢赛宁团队成功跟随解构扩散模型探索，最终创造出备受赞誉的去噪自编码器Jan 29, 2024 pm 02:15 PM

去噪扩散模型（DDM）是目前广泛应用于图像生成的一种方法。最近，XinleiChen、ZhuangLiu、谢赛宁和何恺明四人团队对DDM进行了解构研究。通过逐步剥离其组件，他们发现DDM的生成能力逐渐下降，但表征学习能力仍然保持一定水平。这说明DDM中的某些组件对于表征学习的作用可能并不重要。针对当前计算机视觉等领域的生成模型，去噪被认为是一种核心方法。这类方法通常被称为去噪扩散模型（DDM），通过学习一个去噪自动编码器（DAE），能够通过扩散过程有效地消除多个层级的噪声。这些方法实现了出色的图

See all articles