search
HomeTechnology peripheralsAIHow many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

For robots, Task Planning is an unavoidable problem.

If you want to complete a real-world task, you must first know how many steps it takes to install an elephant in the refrigerator.

Even the relatively simple

throwing an apple task contains multiple sub-steps, and the robot must observe the position of the apple, if does not see the apple, we must continue to look for , then approach the apple, grab the apple,find and Near the trash can.

If the trash can is closed, you must open it first, and then Throw the apple in and close the trash can.

But the

specific implementation details of each task cannot be designed by humans. How to generate the action sequence through a command is enough. problem.

Use

command to generate sequence ? Isn't this exactly the job of Language Model?

In the past, researchers have used large language models (LLMs) to score the potential next action space based on input task instructions and then generate action sequences.

Instructions are described in natural language and do not contain additional domain information.

But such methods either need to enumerate all possible next actions for scoring, or the generated text has no restrictions in form, which may contain specific robots in the current environment

impossibleaction.

Recently, the University of Southern California and NVIDIA jointly launched a new model

ProgPrompt, which also uses a language model to perform task planning on input instructions, which includes a The programmed prompt structure enables the generated plans to work in different environments, robots with different abilities, and different tasks.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

In order to ensure the standardization of the task, the researchers used

generated python style code to prompt the language model which actions are available, what objects are in the environment, and which programs are executable.

For example, enter the

"throw apple" command to generate the following program.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

The ProgPrompt model achieved sota performance

on the virtual home task, and the researchers also deployed the model on a Physical Robot Arm for Desktop Tasks on. Magical Language Model

Completing daily household tasks requires both a common sense understanding of the world and situational knowledge of the current environment.

In order to create a task plan of "cooking dinner", the minimum knowledge that the agent needs to know includes:

Functions of objects, such as stoves and microwave ovens can be used heating; logical sequence of actions, the oven must be preheated before adding food; and task relevance of objects and actions, such as heating and finding ingredients are first related to "dinner" action.

But without

state feedback (state feedback), this kind of reasoning cannot be carried out.

The agent needs to know

where there is food in the current environment, such as whether there is fish in the refrigerator, or whether there is chicken in the refrigerator.

Autoregressive large-scale language models trained on large corpora can generate text sequences under the condition of input prompts and have significant multi-task generalization capabilities.

For example, if you enter "make dinner", the language model can generate subsequent sequences, such as opening the refrigerator, picking up the chicken, picking up the soda, closing the refrigerator, turning on the light switch, etc.

The generated text sequence needs to be mapped to the action space of the agent. For example, if the generated instruction is "reach out and pick up a jar of pickles", the corresponding executable action may be "pick up jar", the model then calculates a probability score for an action.

But in the absence of environmental feedback, if there is no chicken in the refrigerator and you still choose to "pick up the chicken", the task will fail because "making dinner" does not include Any information about the state of the world.

The ProgPrompt model cleverly utilizes programming language structures in task planning, because existing large-scale language models are usually conducted in the corpus of programming tutorials and code documents Pre-training.

ProgPrompt provides the language model with a Pythonic program header as a prompt, importing the available action space, expected parameters, and available objects in the environment.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

Then define such as make_dinner, throw_away_banana and other functions, the main body of which is to operate objects The action sequence is then incorporated by asserting the planned prerequisites , such as approaching the refrigerator before trying to open it, and responding to assertion failures with recovery actions Environment status feedback.

The most important thing is that the ProgPrompt program also includes comments written in natural language to explain the goals of the action, thereby improving the execution of the generated plan program Mission success rate.

ProgPrompt

With the complete idea, the overall workflow of ProgPrompt is clear, which mainly includes three parts, Pythonic function Construction , Constructing programming language prompts , Generation and execution of task plans .

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

1. Express the robot plan as a Pythonic function

Planning functions include API calls to action primitives , summarizing actions and adding comments, and assertions to track execution.

Each action primitive requires an object as a parameter. For example, the "Put salmon into the microwave" task includes a call to find(salmon), where find is an action primitive. .

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

Use comments in the code to provide natural language summaries for subsequent action sequences. Comments help break down high-level tasks into appropriate The logical subtasks are "catch the salmon" and "put the salmon in the microwave".

Annotations can also allow the language model to understand the current goal and reduce the possibility of incoherent, inconsistent or repeated output, similar to a chain of thought Generate intermediate results.

Assertions (assertions) Provides an environment feedback mechanism to ensure that preconditions are true and to implement error recovery when they are not true, such as before a crawl action. The plan asserts that the agent is close to the salmon, otherwise the agent needs to perform a find action first.

2. Constructing programming language prompt

prompt needs to provide information about the environment to the language model and main action information, including observations, action primitives, examples, and generated a Pythonic prompt for language model completion.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

Then, the language model predicts as an executable function, namely microwave_salmon()

in microwave salmon In this task, a reasonable first step that LLM can generate is to take out the salmon, but the agent responsible for executing the plan may not have such an action primitive.

In order for the language model to understand the action primitives of the agent, import them through the import statement in prompt, which also limits the output to functions available in the current environment.

To change the behavior space of the agent, you only need toupdate the import function list.

The variable objects provides all available objects in the environment as a list of strings.

#prompt also includes a number of fully executable program plans as examples. Each example task demonstrates how to complete a given task using the available actions and goals in a given environment. , such as throw_away_lime

3, generation and execution of task plan

given task After that, the plan is completely inferred by the language model based on the ProgPrompt prompt, and then the generated plan can be executed on the virtual agent or physical robot system. An interpreter is required to execute each action command according to the environment.

During execution, assertion checks are performed in a closed-loop manner and feedback is provided based on the current environment state.

In the experimental part, the researchers evaluated the method on the Virtual Home (VH) simulation platform.

The status of VH includes a set of objects and corresponding attributes, such as salmon inside the microwave oven (in), or close to (agent_close_to), etc.

The action space includes grab, putin, putback, walk, find, open, close close) etc.

Finally, 3 VH environments were experimented, each environment included 115 different objects. The researchers created a data set containing 70 housework tasks, with a high level of abstraction and command It's all about "microwave salmon" and creating a ground-truth action sequence for it.

After evaluating the generated program on the virtual family, the evaluation indicators include success rate (SR), goal conditional recall (GCR) and executability (Exec). From the results It can be seen that ProgPrompt is significantly better than the baseline and LangPrompt. The table also shows how each feature improves performance.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

The researchers also conducted experiments in the real world, using a Franka-Emika panda robot with parallel claws, And assume that a pick-and-place strategy can be obtained.

This strategy takes two point clouds of the target object and the target container as input, and performs pick and place operations to place the object on or inside the container.

The system implementation introduces an open vocabulary object detection model ViLD to identify and segment objects in the scene, and build a list of available objects in the prompt.

Unlike in the virtual environment, the object list here is a local variable of each planning function, which allows more flexibility to adapt to new objects.

The plan output by the language model contains function calls in the form of grab and putin.

Due to real-world uncertainties, the assertion-based closed-loop option was not implemented in the experimental setup.

How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots

It can be seen that in the classification task, the robot was able to identify bananas and strawberries as fruits and generate planning steps to place them on the plate inside and put the bottle in the box.

The above is the detailed content of How many steps does it take to install an elephant in the refrigerator? NVIDIA releases ProgPrompt, allowing language models to plan plans for robots. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
A Comprehensive Guide to ExtrapolationA Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayThe Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierEvolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgNew Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficThe 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DMIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.