


Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!
Xi Xiaoyao Technology Talk Original
Author | IQ has dropped all over the place
Recently, many teams have re-created based on the user-friendly ChatGPT, many of which have relatively eye-catching results. The InternChat work emphasizes user-friendliness by interacting with the chatbot in ways beyond language (cursors and gestures) for multimodal tasks. The name of InternChat is also interesting. It stands for interaction, nonverbal and chatbots. It can be referred to as iChat. Unlike existing interactive systems that rely on pure language, iChat significantly improves the efficiency of communication between users and chatbots by adding pointing instructions. In addition, the author also provides a large visual language model called Husky that can perform capture and visual question answering, and can also impress GPT-3.5-turbo with only 7 billion parameters.
However, due to the popularity of the Demo website, the team officially closed the experience page temporarily. Let’s first understand the content of this work through the following video~
Paper title:
InternChat: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language
Paper link:
https://www.php.cn/link/7c9966afcc510cf5a40621d1d92bdaf1
Demo address :
https://www.php.cn/link/e355ad06c5a89f911fbb0aff2de52435
Project address:
https://www.php.cn/link/ 2d13d901966a8eaa7f9c943eba6a540b
Main features of the system
The author has provided some task screenshots on the project homepage, so that you can intuitively see some functions and effects of this interactive system:
(a) Remove obscured objects
(b) Interactive image editing
(c) Image generation
(d) Interactive visual question and answer
( e) Interactive image generation
(f) Video highlight explanation
Paper quick overview
Here we first introduce the two concepts mentioned in this article:
- Vision-centric tasks: In order for computers to understand what they see from the world and react accordingly .
- Communication in the form of non-verbal instructions: pointing actions such as cursors and hand gestures.
▲Figure 1 The overall architecture of iChat
iChat combines the advantages of pointing and language instructions to perform vision-centric tasks. As shown in Figure 1, this system consists of 3 main components:
- A perception unit that processes pointing instructions on images or videos;
- Has an auxiliary control that can accurately parse language instructions LLM controller of the mechanism;
- An open world toolkit that integrates HuggingFace's various online models, user-trained private models, and other applications (such as calculators and search engines).
It can effectively operate on 3 levels, namely:
- Basic interaction;
- Language-guided interaction;
- Point-to-language-enhanced interaction.
Thus, as shown in Figure 2, when a pure language system cannot complete the task, the system can still successfully perform complex interactive tasks.
▲Figure 2 Pointing to the advantages of language-driven interactive system
Experiment
First let’s look at combining language and non-language Commands to improve communication with interactive systems. To demonstrate the advantages of this hybrid model compared to pure language instructions, the research team conducted a user survey. Participants chatted with Visual ChatGPT and iChat and gave feedback on their experience using it. The results in Tables 1 and 2 show that iChat is more efficient and user-friendly than Visual ChatGPT.
▲Table 1 User survey of “Remove something”
▲Table 2 “Replace with something” "Something" user survey
Summary
However, the system still has some limitations, including:
- The efficiency of iChat is greatly improved. The extent depends on the quality and accuracy of its underlying open source model. However, these models may have limitations or biases that adversely affect iChat performance.
- As user interactions become more complex or the number of instances increases, the system needs to maintain accuracy and response time, which can be challenging for iChat.
- In addition, there is a lack of learnable collaboration between current vision and language-based models, such as the lack of functions that can be adjusted by the instruction data.
- iChat may have difficulty responding to novel or unusual situations outside of the training data, causing performance to suffer.
- Achieving seamless integration across different devices and platforms can be challenging because of varying hardware capabilities, software limitations, and accessibility requirements.
On the plan list listed on the project homepage, there are still several goals that have not yet been achieved. Among them is the Chinese interaction that the editor must experience every time on the new dialogue system. Currently, this The system still probably does not support Chinese for the time being, but there seems to be no solution. Since most multi-modal data sets are based on English, English-Chinese translation wastes online resources and processing time. It is estimated that the road to Chineseization will still take some time.
The above is the detailed content of Prompt is no longer needed. You can play the multi-modal dialogue system with just your hands. iChat is here!. For more information, please follow other related articles on the PHP Chinese website!

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools

Atom editor mac version download
The most popular open source editor

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
