


Human beings have always dreamed of robots that can assist humans in handling life and work matters. “Please help me turn down the temperature of the air conditioner” and even “Please help me write a shopping mall website” have all been realized in recent years with home assistants and Copilot released by OpenAI.
The emergence of GPT-4 further shows us the potential of multi-modal large models in visual understanding. In terms of open source small and medium-sized models, LLAVA and minigpt-4 perform well. They can look at pictures and chat, and can also guess recipes in food pictures for humans. However, these models still face important challenges in actual implementation: they do not have precise positioning capabilities, cannot give the specific location of an object in the picture, and cannot understand complex human instructions to detect specific objects. Therefore, they are often unable to execute human tasks. specific tasks. In actual scenarios, people encounter complex problems. If they can ask the smart assistant to get the correct answer by taking a photo, such a "photo and ask" function is simply cool.
To realize the function of "taking photos and asking questions", the robot needs to have multiple abilities:
1. Language understanding ability: able to listen Understand and understand human intentions
2. Visual understanding ability: able to understand the objects in the picture you see
3. Common sense reasoning ability : Able to convert complex human intentions into precise targets that can be located
4. Object positioning ability: Able to locate and detect corresponding objects from the screen
Currently, only a few large models (such as Google’s PaLM-E) possess these four capabilities. However, researchers from Hong Kong University of Science and Technology and Hong Kong University have proposed a fully open source model DetGPT (full name DetectionGPT), which only needs to fine-tune three million parameters, allowing the model to easily possess complex reasoning and local object positioning capabilities, and can be generalized to large scale Most scenes. This means that the model can understand human abstract instructions through reasoning from its own knowledge and easily identify objects of human interest in pictures! They have made the model into a "photo and ask" demo. You are welcome to experience it online: https://detgpt.github.io/
DetGPT allows users to operate everything with natural language. Requires cumbersome commands or interfaces. At the same time, DetGPT also has intelligent reasoning and target detection capabilities, which can accurately understand the user's needs and intentions. For example, when a human sends a verbal command "I want to drink a cold drink", the robot first searches for a cold drink in the scene, but does not find it. So I started thinking, "There is no cold drink in the scene, where should I find it?" Through the powerful common sense reasoning model, I thought of the refrigerator, so I scanned the scene and found the refrigerator, and successfully locked the location of the drink!
- ##Open source code: https://www .php.cn/link/10eb6500bd1e4a3704818012a1593cc3
- Demo Online trial: https://detgpt.github.io/
If you are thirsty in summer, where is the ice drink in the picture? DetGPT Easy to understand Find the refrigerator:
Want to get up early tomorrow? DetGPT easy pick electronic alarm clock:
High blood pressure and get tired easily? Go to the fruit market and don’t know which fruit to buy can relieve high blood pressure? DetGPT acts as your nutrition teacher:
Can’t clear the Zelda game? DetGPT helps you pass the Daughter Kingdom level in disguise:
Are there any dangerous things within the visual field of the picture? DetGPT becomes your safety officer:
What items in the picture are dangerous for children? DetGPT is still OK:
What are the features of DetGPT?
- The ability to understand specific objects in pictures has been greatly improved. Compared with previous multi-modal image-text dialogue models, we can retrieve and locate target objects from pictures by understanding user instructions, rather than simply describing the entire picture.
- Can understand complex human instructions and lower the user's threshold for asking questions. For example, the model can understand the problem "Find the foods in the picture that can alleviate high blood pressure." Traditional target detection requires answers known to humans, and the detection category "banana" is preset in advance.
- DetGPT can reason based on existing LLM knowledge to accurately locate the corresponding objects in the graph that can solve complex tasks. For complex tasks like “foods to relieve high blood pressure.” DetGPT can reason step by step for this complex task: Relieve high blood pressure -> Potassium can relieve high blood pressure -> Bananas are rich in potassium -> Bananas can relieve high blood pressure -> Need to identify the object banana
- Provide answers beyond the scope of human common sense. For some uncommon problems, such as humans not knowing which fruits are rich in potassium, the model can answer them based on existing knowledge.
#New direction worthy of attention: using common sense reasoning to achieve more accurate open set target detection
Traditional detection tasks require presetting possible object categories for detection. But accurately and comprehensively describing the objects to be detected is unfriendly or even unrealistic for humans. Specifically, (1) Limited by limited memory/knowledge, people cannot always accurately express the target objects they want to detect. For example, doctors recommend that people with high blood pressure eat more fruits to supplement potassium, but without knowing which fruits are rich in potassium, they cannot give specific fruit names for the model to detect; "Fruit identification" is thrown to the detection model. Humans only need to take a photo, and the model itself will think, reason, and detect potassium-rich fruits. This problem is much simpler. (2) The object categories that humans can exemplify are not comprehensive. For example, if we monitor behaviors that are not in compliance with public order in public places, humans may be able to simply list a few scenarios such as holding knives and smoking; but if we directly hand over the problem of "detecting behaviors that are not in compliance with public order" to the detection model, If the model thinks by itself and makes inferences based on its own knowledge, it can capture more bad behaviors and generalize to more related categories that need to be detected. After all, the knowledge that ordinary humans understand is limited, and the types of objects that can be cited are also limited. But if there is a brain like ChatGPT for assistance and reasoning, the instructions humans need to give will be much simpler, and the answers obtained It can also be much more accurate and comprehensive.
Based on the abstraction and limitations of human instructions, researchers from the Hong Kong University of Science and Technology and the University of Hong Kong proposed a new direction of "inferential target detection". To put it simply, humans give some abstract tasks, and the model can understand and reason by itself which objects in the picture may complete this task, and detect them. To give a simple example, when a human describes "I want a cold drink, where can I find it?" the model sees a photo of a kitchen, and it can detect the "refrigerator". This topic requires the perfect combination of the image understanding capabilities of multi-modal models and the rich knowledge stored in large language models, and use them in fine-grained detection task scenarios: using the brain of language models to understand human abstract instructions and accurately locate pictures Objects of human interest without preset object categories.
Method introduction
"Inferential target detection" is a difficult problem, because the detector not only needs to understand and reason about the user's coarse-grained/abstract instructions, but also needs to analyze the current situation. See the visual information to locate the target object. In this direction, researchers from HKUST & HKU have conducted some preliminary explorations. Specifically, they utilize a pre-trained visual encoder (BLIP-2) to obtain image visual features and align the visual features to the text space through an alignment function. Use a large-scale language model (Robin/Vicuna) to understand user questions and combine the visual information seen to reason about the objects that the user is really interested in. The object names are then fed to a pretrained detector (Grouding-DINO) for prediction of specific locations. In this way, the model can analyze the picture according to any instructions of the user and accurately predict the location of the object of interest to the user.
It is worth noting that the difficulty here mainly lies in that for different specific tasks, the model must be able to achieve task-specific output without damaging the original capabilities of the model as much as possible. . In order to guide the language model to follow a specific pattern, perform reasoning and generate output that conforms to the target detection format under the premise of understanding images and user instructions, the research team used ChatGPT to generate cross-modal instruction data to fine-tune the model. Specifically, based on 5000 coco images, they leveraged ChatGPT to create 30,000 cross-modal image-text fine-tuning datasets. In order to improve the efficiency of training, they fixed other model parameters and only learned cross-modal linear mapping. Experimental results prove that even if only the linear layer is fine-tuned, the language model can understand fine-grained image features and follow specific patterns to perform inference-based image detection tasks, showing excellent performance.
This research topic has great potential. Based on this technology, the field of home robots will further shine: people at home can use abstract or coarse-grained voice instructions to let robots understand, identify, and locate needed items and provide related services. In the field of industrial robots, this technology will radiate endless vitality: industrial robots can collaborate more naturally with human workers, accurately understand their instructions and needs, and achieve intelligent decision-making and operations. On the production line, human workers can use coarse-grained voice instructions or text input to let the robot automatically understand, identify and locate the items that need to be processed, thereby improving production efficiency and quality.
Based on the target detection model with its own reasoning capabilities, we can develop more intelligent, natural and efficient robots to provide humans with more convenient, efficient and humane services . This is an area with broad prospects. It also deserves more researchers’ attention and further exploration.
It is worth mentioning that DetGPT supports multiple language models and has been verified based on two language models: Robin-13B and Vicuna-13B. The Robin series language model is a dialogue model trained by the LMFlow team of the Hong Kong University of Science and Technology (https://github.com/OptimalScale/LMFlow). It has achieved equivalent results to Vicuna on multiple language proficiency assessment benchmarks (model download: https:// github.com/OptimalScale/LMFlow#model-zoo). Heart of the Machine previously reported that the LMFlow team can train exclusive ChatGPT in just 5 hours on the consumer graphics card 3090. Today, this team and the HKU NLP Laboratory have brought us another multi-modal surprise.
The above is the detailed content of DetGPT, which can read pictures, chat, and perform cross-modal reasoning and positioning, is here to implement complex scenarios.. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

WebStorm Mac version
Useful JavaScript development tools