Home  >  Article  >  Technology peripherals  >  Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is “much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.

Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is “much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.

WBOY
WBOYforward
2024-04-02 21:20:21642browse

Written by Noah

produced | 51CTO Technology Stack (WeChat ID: blog51cto)

Siri is always criticized by users for being "a bit mentally retarded" There's help!

Siri has been one of the representatives in the field of intelligent voice assistants since its birth, but its performance has been unsatisfactory for a long time. However, the latest research results released by Apple's artificial intelligence team are expected to significantly change the status quo. These results are exciting and raise great expectations for the future of this field.

In a related research paper, Apple's AI experts describe a system in which Siri can do more than just identify content in images, becoming more Smarter and more practical. This functional model is called ReALM, which is based on the GPT 4.0 standard and has better benchmark capabilities than GPT 4.0. These experts believe that the model they developed is used to implement a function they developed, which can make Siri smarter, more practical, and more suitable for various scenarios.

1. Motivation: Solving the reference resolution of different entities

According to Apple’s research team: “It is very critical to enable the conversation assistant to understand the context, including related content directions. It allows users to Asking questions based on what they see on the screen is an important step in ensuring a voice-operated experience."

For example, during human-computer interaction, users often Mention an element or content on the screen, such as instructing a voice assistant to dial a phone number, navigate to a specific place on a map, open a specific app or web page, etc. If the conversational assistant cannot understand the entity reference behind the user's instructions, it will not be able to accurately execute those commands.

Moreover, the phenomenon of fuzzy reference is common in human conversations. In order to achieve natural human-computer interaction and accurately understand the context when users make inquiries about screen content with voice assistants, referencing Generation analysis capabilities are crucial.

The advantage of the model called ReALM (Reference Resolution As Language Modeling) mentioned by Apple in the paper is that it can consider both the content on the user's screen and the ongoing The task is to use large language models to solve the problem of reference resolution of different types of entities (including conversational entities and non-conversation entities).

Although the traditional text modality is not convenient for processing entities displayed on the screen, the ReALM system transforms referential parsing into a language modeling problem and successfully uses LLMs to process entities displayed on the screen. The reference of non-conversational entities greatly facilitates the achievement of this goal. In this way, it is expected to achieve a highly intelligent and more immersive user experience.

2. Reconstruction: Breaking through the limitations of the traditional text modal

The traditional text modal is not convenient for processing entities displayed on the screen because of the entities on the screen Usually contains rich visual information and layout structure, such as images, icons, buttons and the relative position between them, etc. This information is difficult to fully express in pure text description.

To address this challenge, the ReALM system creatively proposes to reconstruct the screen by parsing entities on the screen and their position information, and generates a plain text representation that can be visualized reflect the screen content.

Entity parts are specially marked so that the language model understands where the entity appears and what text is around it, so it can simulate "seeing" the information on the screen and Provides necessary contextual information when understanding and parsing on-screen referents. This approach is the first attempt to use a large language model to encode context from screen content, overcoming the problem of screen entities that are difficult to handle with traditional text modalities.

Specifically, the ReALM system uses the following steps to enable large language models to "understand" and process entities displayed on the screen:

First, use the upper-layer data detector to extract entities in the screen text. These entities will have types, bounding boxes, and lists of non-entity text elements around the entities. This means that for every visual entity on the screen, the system captures its basic information and the context in which it exists.

Then, ReALM innovatively proposes an algorithm by dividing the center points of the bounding boxes of entities and surrounding objects in vertical (from top to bottom) and horizontal (from left to right) ) in order and arranged stably. If the distance between entities is close, they are considered to be on the same line and separated by tabs; if the distance exceeds the set margin, they are placed on the next line. In this way, by continuously applying the above method, the screen content can be encoded into a plain text format from left to right and top to bottom, effectively retaining the relative spatial positional relationship between entities.

In this way, the screen visual information that was originally difficult to be directly processed by LLM is converted into a text form suitable for language model input, allowing LLM to fully process sequence-to-sequence tasks. The specific location and context of the screen entities are taken into account to achieve the correct identification and reference resolution of the screen entities.

This makes the ReALM system not only perform well in solving the problem of referring to dialogue entities, but also shows remarkable performance in dealing with non-dialogue entities-that is, entities on the screen. Performance improvements.

3. Details: Task definition and data set

To put it simply, the task faced by the ReALM system is to, according to the tasks that the user wants to perform, in a given entity set, Find entities related to the current user query.

This task is structured as a multiple-choice question for a large language model, and it is expected to select one or more options as the answer from the entities displayed on the user's screen. Of course, in some cases, the answer may be "neither".

In fact, the research paper divides the entities involved in the task into three categories:

1. Screen entity: refers to the current user interface visible entity.

2. Dialogue entities: Entities related to the conversation content, which may come from the user's previous speech (for example, if the user mentions "Call Mom", the entry of "Mom" in the contact list is the relevant entity) , or may be provided by a virtual assistant in a conversation (such as a list of places for the user to choose from).

3. Background entities: related entities originating from background processes and not necessarily directly reflected in the user's screen display or interaction with the virtual assistant, such as an alarm clock that will sound by default or music playing in the background.

As for the data set used to train and test ReALM, it consists of synthetic data and manually annotated data, which can also be divided into three categories:

First, the dialogue data set: contains data points of entities related to the interaction between the user and the agent. These data were collected by having raters view screenshots containing lists of synthetic entities and asking them to provide queries that pointed explicitly to any selected entity in the list.

Second, synthetic data set: Use template generation method to obtain data. This method is particularly useful when the user query and entity type are sufficient to determine the reference without relying on detailed descriptions. . The synthetic data set can also contain multiple entities corresponding to the same query.

Third, screen data set: mainly covers the data of entities currently displayed on the user's screen. Each piece of data includes user query, entity list and the correct entity corresponding to the query. (or collection of entities). Information about each entity includes the entity type and other properties such as name and other textual details associated with the entity (e.g., the label and time of an alarm clock).

For data points with screen-related context, context information is provided in the form of the entity's bounding box and a list of other objects surrounding the entity, along with the types and text content of these surrounding objects. and location attribute information. The size of the entire data set is divided into training set and test set according to categories, and each has a certain size.

4. Results: The smallest model also achieved a 5% performance improvement

In the benchmark test, Apple compared its own system with GPT 3.5 and GPT 4.0. The ReALM model shows excellent competitiveness in solving different types of referential parsing tasks.

Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is “much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.Picture

According to the paper, even the version with the fewest parameters in ReALM, compared to It also achieved more than 5% performance improvement over the baseline system. On the larger model version, ReALM clearly outperforms GPT-4. Especially when processing entities displayed on the screen, as the model size increases, the performance improvement of ReALM on the screen data set becomes more significant.

In addition, the performance of the ReALM model is quite close to that of GPT-4 in zero-sample learning scenarios in new fields. When processing queries in specific fields, the ReALM model performs more accurately than GPT-4 due to fine-tuning based on user requests.

For example, for a user request to adjust the brightness, GPT-4 only associates the request with the settings, ignoring that the smart home devices existing in the background are also related entities, and ReALM Because it is trained on domain-specific data, it can better understand and correctly parse the reference issues in such specific domains.

"We demonstrate that RealLM outperforms previous methods and even handles screens based purely on text fields despite having far fewer parameters than the current state-of-the-art LLM, GPT-4. When citing, ReaLM can also achieve a comparable level of performance. In addition, for user utterances in specific fields, ReaLM performs better than GPT-4. Therefore, ReaLM can be said to be suitable for development-oriented applications while ensuring that performance is not compromised. It is the preferred solution for practical application environments and a reference resolution system that can run efficiently locally on the device."

In addition, the researchers also said that when resources are limited, low-latency response is required, or multiple processes are involved In practical application scenarios such as stage integration such as API calls, a single large-scale end-to-end model is often not applicable.

In this context, the modularly designed ReALM system has more advantages, allowing the original reference resolution module to be easily replaced and upgraded without affecting the overall architecture. while providing better optimization potential and interpretability.

Facing the future, the research direction points to more complex methods, such as dividing the screen area into grids and encoding relative spatial positions in text form. Although it is quite challenging, this is a promising avenue to explore.

5. Written at the end

In the field of artificial intelligence, although Apple has always been more cautious, it is also investing quietly. Whether it is the multi-modal large model MM1, or the AI-driven animation generation tool Keyframer, or today's ReALM, Apple's research team has continued to achieve technological breakthroughs.

Competitors such as Google, Microsoft, and Amazon have all added AI to search, cloud services, and office software, flexing their muscles one after another. Apple is clearly trying not to be left behind. As the results of generative AI implementation continue to emerge, Apple has accelerated its pace of catching up. People familiar with the matter have long revealed that Apple will focus on the field of artificial intelligence at the Global Developers Conference in June, and the new artificial intelligence strategy is likely to become the core content of the iOS 18 upgrade. By then, it may bring you surprises.

Reference link:

https://apple.slashdot.org/story/24/04/01/1959205/apple-ai-researchers-boast-useful -on-device-model-that-substantially-outperforms-gpt-4

https://arxiv.org/pdf/2403.20329.pdf

The above is the detailed content of Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is “much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete