Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is 'much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.

Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is 'much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 02, 2024 pm 09:20 PM

AIgpt-4api callarrangement

Written by Noah

produced | 51CTO Technology Stack (WeChat ID: blog51cto)

Siri is always criticized by users for being "a bit mentally retarded" There's help!

Siri has been one of the representatives in the field of intelligent voice assistants since its birth, but its performance has been unsatisfactory for a long time. However, the latest research results released by Apple's artificial intelligence team are expected to significantly change the status quo. These results are exciting and raise great expectations for the future of this field.

In a related research paper, Apple's AI experts describe a system in which Siri can do more than just identify content in images, becoming more Smarter and more practical. This functional model is called ReALM, which is based on the GPT 4.0 standard and has better benchmark capabilities than GPT 4.0. These experts believe that the model they developed is used to implement a function they developed, which can make Siri smarter, more practical, and more suitable for various scenarios.

1. Motivation: Solving the reference resolution of different entities

According to Apple’s research team: “It is very critical to enable the conversation assistant to understand the context, including related content directions. It allows users to Asking questions based on what they see on the screen is an important step in ensuring a voice-operated experience."

For example, during human-computer interaction, users often Mention an element or content on the screen, such as instructing a voice assistant to dial a phone number, navigate to a specific place on a map, open a specific app or web page, etc. If the conversational assistant cannot understand the entity reference behind the user's instructions, it will not be able to accurately execute those commands.

Moreover, the phenomenon of fuzzy reference is common in human conversations. In order to achieve natural human-computer interaction and accurately understand the context when users make inquiries about screen content with voice assistants, referencing Generation analysis capabilities are crucial.

The advantage of the model called ReALM (Reference Resolution As Language Modeling) mentioned by Apple in the paper is that it can consider both the content on the user's screen and the ongoing The task is to use large language models to solve the problem of reference resolution of different types of entities (including conversational entities and non-conversation entities).

Although the traditional text modality is not convenient for processing entities displayed on the screen, the ReALM system transforms referential parsing into a language modeling problem and successfully uses LLMs to process entities displayed on the screen. The reference of non-conversational entities greatly facilitates the achievement of this goal. In this way, it is expected to achieve a highly intelligent and more immersive user experience.

The traditional text modal is not convenient for processing entities displayed on the screen because of the entities on the screen Usually contains rich visual information and layout structure, such as images, icons, buttons and the relative position between them, etc. This information is difficult to fully express in pure text description.

To address this challenge, the ReALM system creatively proposes to reconstruct the screen by parsing entities on the screen and their position information, and generates a plain text representation that can be visualized reflect the screen content.

Entity parts are specially marked so that the language model understands where the entity appears and what text is around it, so it can simulate "seeing" the information on the screen and Provides necessary contextual information when understanding and parsing on-screen referents. This approach is the first attempt to use a large language model to encode context from screen content, overcoming the problem of screen entities that are difficult to handle with traditional text modalities.

Specifically, the ReALM system uses the following steps to enable large language models to "understand" and process entities displayed on the screen:

First, use the upper-layer data detector to extract entities in the screen text. These entities will have types, bounding boxes, and lists of non-entity text elements around the entities. This means that for every visual entity on the screen, the system captures its basic information and the context in which it exists.

Then, ReALM innovatively proposes an algorithm by dividing the center points of the bounding boxes of entities and surrounding objects in vertical (from top to bottom) and horizontal (from left to right) ) in order and arranged stably. If the distance between entities is close, they are considered to be on the same line and separated by tabs; if the distance exceeds the set margin, they are placed on the next line. In this way, by continuously applying the above method, the screen content can be encoded into a plain text format from left to right and top to bottom, effectively retaining the relative spatial positional relationship between entities.

In this way, the screen visual information that was originally difficult to be directly processed by LLM is converted into a text form suitable for language model input, allowing LLM to fully process sequence-to-sequence tasks. The specific location and context of the screen entities are taken into account to achieve the correct identification and reference resolution of the screen entities.

This makes the ReALM system not only perform well in solving the problem of referring to dialogue entities, but also shows remarkable performance in dealing with non-dialogue entities-that is, entities on the screen. Performance improvements.

3. Details: Task definition and data set

To put it simply, the task faced by the ReALM system is to, according to the tasks that the user wants to perform, in a given entity set, Find entities related to the current user query.

This task is structured as a multiple-choice question for a large language model, and it is expected to select one or more options as the answer from the entities displayed on the user's screen. Of course, in some cases, the answer may be "neither".

In fact, the research paper divides the entities involved in the task into three categories:

1. Screen entity: refers to the current user interface visible entity.

2. Dialogue entities: Entities related to the conversation content, which may come from the user's previous speech (for example, if the user mentions "Call Mom", the entry of "Mom" in the contact list is the relevant entity) , or may be provided by a virtual assistant in a conversation (such as a list of places for the user to choose from).

3. Background entities: related entities originating from background processes and not necessarily directly reflected in the user's screen display or interaction with the virtual assistant, such as an alarm clock that will sound by default or music playing in the background.

As for the data set used to train and test ReALM, it consists of synthetic data and manually annotated data, which can also be divided into three categories:

First, the dialogue data set: contains data points of entities related to the interaction between the user and the agent. These data were collected by having raters view screenshots containing lists of synthetic entities and asking them to provide queries that pointed explicitly to any selected entity in the list.

Second, synthetic data set: Use template generation method to obtain data. This method is particularly useful when the user query and entity type are sufficient to determine the reference without relying on detailed descriptions. . The synthetic data set can also contain multiple entities corresponding to the same query.

Third, screen data set: mainly covers the data of entities currently displayed on the user's screen. Each piece of data includes user query, entity list and the correct entity corresponding to the query. (or collection of entities). Information about each entity includes the entity type and other properties such as name and other textual details associated with the entity (e.g., the label and time of an alarm clock).

For data points with screen-related context, context information is provided in the form of the entity's bounding box and a list of other objects surrounding the entity, along with the types and text content of these surrounding objects. and location attribute information. The size of the entire data set is divided into training set and test set according to categories, and each has a certain size.

4. Results: The smallest model also achieved a 5% performance improvement

In the benchmark test, Apple compared its own system with GPT 3.5 and GPT 4.0. The ReALM model shows excellent competitiveness in solving different types of referential parsing tasks.

Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system. Picture

According to the paper, even the version with the fewest parameters in ReALM, compared to It also achieved more than 5% performance improvement over the baseline system. On the larger model version, ReALM clearly outperforms GPT-4. Especially when processing entities displayed on the screen, as the model size increases, the performance improvement of ReALM on the screen data set becomes more significant.

In addition, the performance of the ReALM model is quite close to that of GPT-4 in zero-sample learning scenarios in new fields. When processing queries in specific fields, the ReALM model performs more accurately than GPT-4 due to fine-tuning based on user requests.

For example, for a user request to adjust the brightness, GPT-4 only associates the request with the settings, ignoring that the smart home devices existing in the background are also related entities, and ReALM Because it is trained on domain-specific data, it can better understand and correctly parse the reference issues in such specific domains.

"We demonstrate that RealLM outperforms previous methods and even handles screens based purely on text fields despite having far fewer parameters than the current state-of-the-art LLM, GPT-4. When citing, ReaLM can also achieve a comparable level of performance. In addition, for user utterances in specific fields, ReaLM performs better than GPT-4. Therefore, ReaLM can be said to be suitable for development-oriented applications while ensuring that performance is not compromised. It is the preferred solution for practical application environments and a reference resolution system that can run efficiently locally on the device."

In addition, the researchers also said that when resources are limited, low-latency response is required, or multiple processes are involved In practical application scenarios such as stage integration such as API calls, a single large-scale end-to-end model is often not applicable.

In this context, the modularly designed ReALM system has more advantages, allowing the original reference resolution module to be easily replaced and upgraded without affecting the overall architecture. while providing better optimization potential and interpretability.

Facing the future, the research direction points to more complex methods, such as dividing the screen area into grids and encoding relative spatial positions in text form. Although it is quite challenging, this is a promising avenue to explore.

5. Written at the end

In the field of artificial intelligence, although Apple has always been more cautious, it is also investing quietly. Whether it is the multi-modal large model MM1, or the AI-driven animation generation tool Keyframer, or today's ReALM, Apple's research team has continued to achieve technological breakthroughs.

Competitors such as Google, Microsoft, and Amazon have all added AI to search, cloud services, and office software, flexing their muscles one after another. Apple is clearly trying not to be left behind. As the results of generative AI implementation continue to emerge, Apple has accelerated its pace of catching up. People familiar with the matter have long revealed that Apple will focus on the field of artificial intelligence at the Global Developers Conference in June, and the new artificial intelligence strategy is likely to become the core content of the iOS 18 upgrade. By then, it may bring you surprises.

Reference link:

https://apple.slashdot.org/story/24/04/01/1959205/apple-ai-researchers-boast-useful -on-device-model-that-substantially-outperforms-gpt-4

https://arxiv.org/pdf/2403.20329.pdf

The above is the detailed content of Let Siri no longer be mentally retarded! Apple defines a new client-side model, which is 'much better than GPT-4. It gets rid of text and visually simulates screen information. The minimum parameter model is still 5% better than the baseline system.. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

undress free porn AI tool websiteMay 13, 2025 am 11:26 AM

https://undressaitool.ai/ is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How to create pornographic images/videos using undressAIMay 13, 2025 am 11:26 AM

Tutorial on using undressAI to create pornographic pictures/videos: 1. Open the corresponding tool web link; 2. Click the tool button; 3. Upload the required content for production according to the page prompts; 4. Save and enjoy the results.

undress AI official website entrance website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

How does undressAI generate pornographic images/videos?May 13, 2025 am 11:26 AM

undressAI porn AI official website addressMay 13, 2025 am 11:26 AM

The official address of undress AI is:https://undressaitool.ai/;undressAI is Powerful mobile app with advanced AI features for adult content. Create AI-generated pornographic images or videos now!

UndressAI usage tutorial guide articleMay 13, 2025 am 10:43 AM

[Ghibli-style images with AI] Introducing how to create free images with ChatGPT and copyrightMay 13, 2025 am 01:57 AM

The latest model GPT-4o released by OpenAI not only can generate text, but also has image generation functions, which has attracted widespread attention. The most eye-catching feature is the generation of "Ghibli-style illustrations". Simply upload the photo to ChatGPT and give simple instructions to generate a dreamy image like a work in Studio Ghibli. This article will explain in detail the actual operation process, the effect experience, as well as the errors and copyright issues that need to be paid attention to. For details of the latest model "o3" released by OpenAI, please click here⬇️ Detailed explanation of OpenAI o3 (ChatGPT o3): Features, pricing system and o4-mini introduction Please click here for the English version of Ghibli-style article⬇️ Create Ji with ChatGPT

Explaining examples of use and implementation of ChatGPT in local governments! Also introduces banned local governmentsMay 13, 2025 am 01:53 AM

As a new communication method, the use and introduction of ChatGPT in local governments is attracting attention. While this trend is progressing in a wide range of areas, some local governments have declined to use ChatGPT. In this article, we will introduce examples of ChatGPT implementation in local governments. We will explore how we are achieving quality and efficiency improvements in local government services through a variety of reform examples, including supporting document creation and dialogue with citizens. Not only local government officials who aim to reduce staff workload and improve convenience for citizens, but also all interested in advanced use cases.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.