Home > Article > Technology peripherals > DetGPT, which can read pictures, chat, and perform cross-modal reasoning and positioning, is here to implement complex scenarios.
Human beings have always dreamed of robots that can assist humans in handling life and work matters. “Please help me turn down the temperature of the air conditioner” and even “Please help me write a shopping mall website” have all been realized in recent years with home assistants and Copilot released by OpenAI.
The emergence of GPT-4 further shows us the potential of multi-modal large models in visual understanding. In terms of open source small and medium-sized models, LLAVA and minigpt-4 perform well. They can look at pictures and chat, and can also guess recipes in food pictures for humans. However, these models still face important challenges in actual implementation: they do not have precise positioning capabilities, cannot give the specific location of an object in the picture, and cannot understand complex human instructions to detect specific objects. Therefore, they are often unable to execute human tasks. specific tasks. In actual scenarios, people encounter complex problems. If they can ask the smart assistant to get the correct answer by taking a photo, such a "photo and ask" function is simply cool.
To realize the function of "taking photos and asking questions", the robot needs to have multiple abilities:
1. Language understanding ability: able to listen Understand and understand human intentions
2. Visual understanding ability: able to understand the objects in the picture you see
3. Common sense reasoning ability : Able to convert complex human intentions into precise targets that can be located
4. Object positioning ability: Able to locate and detect corresponding objects from the screen
Currently, only a few large models (such as Google’s PaLM-E) possess these four capabilities. However, researchers from Hong Kong University of Science and Technology and Hong Kong University have proposed a fully open source model DetGPT (full name DetectionGPT), which only needs to fine-tune three million parameters, allowing the model to easily possess complex reasoning and local object positioning capabilities, and can be generalized to large scale Most scenes. This means that the model can understand human abstract instructions through reasoning from its own knowledge and easily identify objects of human interest in pictures! They have made the model into a "photo and ask" demo. You are welcome to experience it online: https://detgpt.github.io/
DetGPT allows users to operate everything with natural language. Requires cumbersome commands or interfaces. At the same time, DetGPT also has intelligent reasoning and target detection capabilities, which can accurately understand the user's needs and intentions. For example, when a human sends a verbal command "I want to drink a cold drink", the robot first searches for a cold drink in the scene, but does not find it. So I started thinking, "There is no cold drink in the scene, where should I find it?" Through the powerful common sense reasoning model, I thought of the refrigerator, so I scanned the scene and found the refrigerator, and successfully locked the location of the drink!
If you are thirsty in summer, where is the ice drink in the picture? DetGPT Easy to understand Find the refrigerator:
Want to get up early tomorrow? DetGPT easy pick electronic alarm clock:
High blood pressure and get tired easily? Go to the fruit market and don’t know which fruit to buy can relieve high blood pressure? DetGPT acts as your nutrition teacher:
Can’t clear the Zelda game? DetGPT helps you pass the Daughter Kingdom level in disguise:
Are there any dangerous things within the visual field of the picture? DetGPT becomes your safety officer:
What items in the picture are dangerous for children? DetGPT is still OK:
#New direction worthy of attention: using common sense reasoning to achieve more accurate open set target detection
Traditional detection tasks require presetting possible object categories for detection. But accurately and comprehensively describing the objects to be detected is unfriendly or even unrealistic for humans. Specifically, (1) Limited by limited memory/knowledge, people cannot always accurately express the target objects they want to detect. For example, doctors recommend that people with high blood pressure eat more fruits to supplement potassium, but without knowing which fruits are rich in potassium, they cannot give specific fruit names for the model to detect; "Fruit identification" is thrown to the detection model. Humans only need to take a photo, and the model itself will think, reason, and detect potassium-rich fruits. This problem is much simpler. (2) The object categories that humans can exemplify are not comprehensive. For example, if we monitor behaviors that are not in compliance with public order in public places, humans may be able to simply list a few scenarios such as holding knives and smoking; but if we directly hand over the problem of "detecting behaviors that are not in compliance with public order" to the detection model, If the model thinks by itself and makes inferences based on its own knowledge, it can capture more bad behaviors and generalize to more related categories that need to be detected. After all, the knowledge that ordinary humans understand is limited, and the types of objects that can be cited are also limited. But if there is a brain like ChatGPT for assistance and reasoning, the instructions humans need to give will be much simpler, and the answers obtained It can also be much more accurate and comprehensive.
Based on the abstraction and limitations of human instructions, researchers from the Hong Kong University of Science and Technology and the University of Hong Kong proposed a new direction of "inferential target detection". To put it simply, humans give some abstract tasks, and the model can understand and reason by itself which objects in the picture may complete this task, and detect them. To give a simple example, when a human describes "I want a cold drink, where can I find it?" the model sees a photo of a kitchen, and it can detect the "refrigerator". This topic requires the perfect combination of the image understanding capabilities of multi-modal models and the rich knowledge stored in large language models, and use them in fine-grained detection task scenarios: using the brain of language models to understand human abstract instructions and accurately locate pictures Objects of human interest without preset object categories.
"Inferential target detection" is a difficult problem, because the detector not only needs to understand and reason about the user's coarse-grained/abstract instructions, but also needs to analyze the current situation. See the visual information to locate the target object. In this direction, researchers from HKUST & HKU have conducted some preliminary explorations. Specifically, they utilize a pre-trained visual encoder (BLIP-2) to obtain image visual features and align the visual features to the text space through an alignment function. Use a large-scale language model (Robin/Vicuna) to understand user questions and combine the visual information seen to reason about the objects that the user is really interested in. The object names are then fed to a pretrained detector (Grouding-DINO) for prediction of specific locations. In this way, the model can analyze the picture according to any instructions of the user and accurately predict the location of the object of interest to the user.
It is worth noting that the difficulty here mainly lies in that for different specific tasks, the model must be able to achieve task-specific output without damaging the original capabilities of the model as much as possible. . In order to guide the language model to follow a specific pattern, perform reasoning and generate output that conforms to the target detection format under the premise of understanding images and user instructions, the research team used ChatGPT to generate cross-modal instruction data to fine-tune the model. Specifically, based on 5000 coco images, they leveraged ChatGPT to create 30,000 cross-modal image-text fine-tuning datasets. In order to improve the efficiency of training, they fixed other model parameters and only learned cross-modal linear mapping. Experimental results prove that even if only the linear layer is fine-tuned, the language model can understand fine-grained image features and follow specific patterns to perform inference-based image detection tasks, showing excellent performance.
This research topic has great potential. Based on this technology, the field of home robots will further shine: people at home can use abstract or coarse-grained voice instructions to let robots understand, identify, and locate needed items and provide related services. In the field of industrial robots, this technology will radiate endless vitality: industrial robots can collaborate more naturally with human workers, accurately understand their instructions and needs, and achieve intelligent decision-making and operations. On the production line, human workers can use coarse-grained voice instructions or text input to let the robot automatically understand, identify and locate the items that need to be processed, thereby improving production efficiency and quality.
Based on the target detection model with its own reasoning capabilities, we can develop more intelligent, natural and efficient robots to provide humans with more convenient, efficient and humane services . This is an area with broad prospects. It also deserves more researchers’ attention and further exploration.
It is worth mentioning that DetGPT supports multiple language models and has been verified based on two language models: Robin-13B and Vicuna-13B. The Robin series language model is a dialogue model trained by the LMFlow team of the Hong Kong University of Science and Technology (https://github.com/OptimalScale/LMFlow). It has achieved equivalent results to Vicuna on multiple language proficiency assessment benchmarks (model download: https:// github.com/OptimalScale/LMFlow#model-zoo). Heart of the Machine previously reported that the LMFlow team can train exclusive ChatGPT in just 5 hours on the consumer graphics card 3090. Today, this team and the HKU NLP Laboratory have brought us another multi-modal surprise.
The above is the detailed content of DetGPT, which can read pictures, chat, and perform cross-modal reasoning and positioning, is here to implement complex scenarios.. For more information, please follow other related articles on the PHP Chinese website!