Home >Technology peripherals >AI >Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal
2024-07-16 11:27:391145browse
Imagine if a robot could understand your needs and work hard to meet them, wouldn’t it be great?

If you want a robot to help you, you usually need to give a more precise command, but the actual implementation of the command may not be ideal. If we consider the real environment, when the robot is asked to find a specific item, the item may not actually exist in the current environment, and the robot cannot find it anyway; but is it possible that there is another item in the environment, which is related to the user? Does the requested item have similar functions and can also meet the user's needs? This is the benefit of using "requirements" as task instructions.

Recently, Peking University Dong Hao’s team proposed a new navigation task - Demand-driven Navigation (DDN), has been accepted by NeurIPS 2023. In this task, the robot is required to find items that meet the user's needs based on a demand instruction given by the user. At the same time, Dong Hao's team also proposed learning the attribute characteristics of items based on demand instructions, which effectively improved the success rate of the robot in finding items.

Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient
  • Paper address: https://arxiv.org/pdf/2309.08138.pdf

  • Project homepage: https://sites.google.com/view/demand-driven-navigation/home Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

                                                                                                                                                                                                                                                               ​The robot will receive a demand command, such as "I'm hungry", "I'm thirsty" ", then the robot needs to find an item in the scene that can meet the need. Therefore, demand-driven navigation is essentially a task of finding items, and there has been a similar task before - visual object navigation (Visual Object Navigation). The difference between these two tasks is that the former is to tell the robot "what are my needs", and the latter is to tell the robot "what items I want".
Using needs as instructions means that the robot needs to reason about the content of the instructions and explore the types of items in the current scene before it can find items that meet the user's needs. From this point of view, demand-driven navigation is much more difficult than visual item navigation. Although the difficulty has increased, once the robot learns to find items according to demand instructions, there are still many benefits. For example:

Users only need to give instructions according to their own needs, without considering what is in the scene.

Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

Using needs as instructions can increase the probability of user needs being met. For example, when you are "thirsty", asking the robot to find "tea" and asking the robot to find "items that can quench your thirst" obviously have a wider scope in the latter.

  • Requirements described in natural language have a larger description space and can put forward more precise and precise requirements.

  • In order to train such a robot, it is necessary to establish a mapping relationship between demand instructions and items so that the environment can give training signals. In order to reduce costs, Dong Hao's team proposed a "semi-automatic" generation method based on a large language model: first use GPT-3.5 to generate needs that can be met by items existing in the scene, and then manually filter out those that do not meet the requirements.

  • Algorithm design

    Considering that items that can meet the same needs have similar attributes, if the characteristics of the attributes of such items can be learned, the robot seems to be able to use these attribute characteristics to find items. For example, for the requirement "I am thirsty", the required items should have the attribute of "quenching thirst", and "juice" and "tea" both have this attribute. What needs to be noted here is that an item may exhibit different attributes under different needs. For example, "water" can exhibit both the attribute of "cleaning clothes" (under the requirement of "washing clothes") and Expose the attribute of "quenching thirst" (under the requirement of "I am thirsty").

    Attribute learning stage

    So, how to make the model understand the needs of "quenching thirst" and "cleaning clothes"? It is a relatively stable common sense to note the attributes displayed by items under certain needs. In recent years, with the gradual rise of large language models (LLM), the understanding of common sense of human society demonstrated by LLM is amazing. Therefore, Peking University Dong Hao’s team decided to learn this common sense from LLM. They first asked LLM to generate a lot of demand instructions (called Language-grounding Demand, LGD in the figure), and then asked LLM which items can satisfy these demand instructions (called Language-grounding Object, LGO in the figure).

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    It should be noted here that the prefix Language-grounding emphasizes that these demand/objects can be obtained from LLM and does not depend on a specific scenario; World-grounding in the figure below emphasizes these demand/objects Object is closely integrated with a specific environment (such as ProcThor, Replica and other scene data sets).

    Then in order to obtain the properties of LGO under LGD, the authors used BERT to encode LGD, CLIP-Text-Encoder to encode LGO, and then spliced ​​them to obtain Demand-object Features. Noting that there was a "similarity" when introducing the attributes of items at the beginning, the authors used this similarity to define "positive and negative samples" and then used contrastive learning to train "item attributes". Specifically, for two spliced ​​Demand-object Features, if the items corresponding to the two features can meet the same requirement, then the two features are positive samples of each other (for example, both item a and item b in the picture are can meet the requirement D1, then DO1-a and DO1-b are positive samples of each other); any other splicing is negative samples of each other. After the authors input the Demand-object Features into an Attribute Module of the TransformerEncoder architecture, they trained with InfoNCE Loss.

    Navigation strategy learning phase

    Through comparative learning, the Attribute Module has learned the common sense provided by LLM. In the navigation strategy learning phase, the parameters of the Attribute Module are directly imported, and then the A* algorithm is learned using imitation learning. Collected tracks. At a certain time step, the author uses the DETR model to segment the items in the current field of view to obtain the World-grounding Object, which is then encoded by CLIP-Visual-Endocer. Other processes are similar to the attribute learning stage. Finally, the BERT features, global image features, and attribute features of the required instructions are spliced, fed into a Transformer model, and finally an action is output.

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    It is worth noting that the authors used CLIP-Text-Encoder in the attribute learning stage, and in the navigation policy learning stage, the authors used CLIP-Visual-Encoder. Here, the powerful visual and text alignment capabilities of the CLIP model are cleverly used to transfer the text common sense learned from LLM to the vision at each time step.

    Experimental results

    The experiment was conducted on the AI2Thor simulator and ProcThor data sets. The experimental results show that this method is significantly higher than previous variants of various visual item navigation algorithms and algorithms supported by large language models.

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    VTN is a closed-vocabulary navigation algorithm that can only perform navigation tasks on preset items. The authors have made some variations of its algorithm. However, whether the BERT features of the required instructions are used as input or the GPT parsing results of the instructions are used as input, the results of the algorithm are not very ideal. When switching to ZSON, an open-vocabulary navigation algorithm, due to the poor alignment effect of CLIP between demand instructions and pictures, several variants of ZSON cannot complete demand driving well. Navigation tasks. However, some algorithms based on heuristic search + LLM have low exploration efficiency due to the large scene area of ​​the Procthor data set, and their success rate is not very high. Pure LLM algorithms, such as GPT-3-Prompt and MiniGPT-4, exhibit poor reasoning capabilities for unseen locations in the scene, resulting in inability to efficiently discover items that meet the requirements.

    Ablation experiments show that Attribute Module significantly improves navigation success rate. The authors show that the t-SNE graph well demonstrates that the Attribute Module successfully learns the attribute features of items through demand-conditioned contrastive learning. After replacing the Attribute Module architecture with MLP, the performance dropped, indicating that the TransformerEncoder architecture is more suitable for capturing attribute characteristics. BERT can well extract the characteristics of required instructions, which improves the generalization of unseen instructions.

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient
    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    Here are some visualizations: Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    The corresponding author of this study, Dr. Dong Hao, is currently an assistant professor at the Frontier Computing Research Center of Peking University, a doctoral supervisor, and a liberal arts youth Scholar and intellectual scholar, he founded and led the Peking University Hyperplane Lab in 2019. He has published more than 40 papers in top international conferences/journals such as NeurIPS, ICLR, CVPR, ICCV, ECCV, etc. Google Scholar It has been cited more than 4,700 times and has won the ACM MM Best Open Source Software Award and the OpenI Outstanding Project Award. He has also served as the field chairperson and deputy editorial board member of top international conferences such as NeurIPS, CVPR, AAAI, and ICRA for many times, undertaken a number of national and provincial projects, and chaired the Ministry of Science and Technology’s New Generation Artificial Intelligence 2030 major project.

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    The first author of the paper, Wang Hongzhen, is currently a second-year doctoral student at the School of Computer Science, Peking University. His research interests focus on robotics, computer vision and psychology. He hopes to start from the aspects of human behavior, cognition and motivation to align the connection between humans and robots.

    Peking Universitys embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient

    Reference links:

    [1] https://zsdonghao.github.io/

    [2] https://whcpumpkin.github.io/

The above is the detailed content of Peking University's embodied intelligence team proposes demand-driven navigation to align human needs and make robots more efficient. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn