The visual talent of large language models: GPT can also solve visual tasks through contextual learning-AI-php.cn

Home

Technology peripherals

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

王林

Jul 14, 2023 pm 03:37 PM

machine learningability

Currently, large language models (LLM) have set off a wave of changes in the field of natural language processing (NLP). We see that LLM has strong emergence capabilities and performs well on complex language understanding tasks, generation tasks and even reasoning tasks. This inspires people to further explore the potential of LLM in another subfield of machine learning - computer vision (CV).

One of the remarkable talents of LLMs is their ability to learn in context. Contextual learning does not update any parameters of the LLM, but it shows amazing results in various NLP tasks. So, can GPT solve visual tasks through contextual learning?

Recently, a paper jointly published by researchers from Google and Carnegie Mellon University (CMU) shows that as long as we can convert images (or other non-verbal modalities) Translated into a language that LLM can understand, this seems feasible.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Paper address: https://arxiv.org/abs/2306.17842

This paper reveals the ability of PaLM or GPT in solving visual tasks through contextual learning, and proposes a new method SPAE (Semantic Pyramid AutoEncoder). This new approach enables LLM to perform image generation tasks without any parameter updates. This is also the first successful method to use contextual learning to enable LLM to generate image content.

Let’s first take a look at the experimental effect of LLM on generating image content through context learning.

For example, by providing 50 images of handwriting in a given context, the paper asks PaLM 2 to answer a complex query that requires generating a digital image as output:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Pictures

can also generate realistic realistic images with image context input:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

In addition to generating images, through context learning, PaLM 2 can also perform image description:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

# #There are also visual Q&A for image-related questions:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Pictures

You can even generate videos with denoising:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Method Overview

In fact, convert the image into a language that LLM can understand , is a problem that has been studied in the Visual Transformer (ViT) paper. In this paper from Google and CMU, they take it to the next level — using actual words to represent images.

This approach is like building a tower filled with text, capturing the semantics and detail of the image. This text-filled representation allows image descriptions to be easily generated and allows LLMs to answer image-related questions and even reconstruct image pixels.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

Specifically, this research proposes to use a trained encoder and CLIP model to convert the image into a token space; and then use LLM to generate a suitable lexical tokens; finally using a trained decoder to convert these tokens back to pixel space. This ingenious process converts images into a language that LLM can understand, allowing us to exploit the generative power of LLM in vision tasks.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning

Experiments and results

This study experimentally compared SPAE with SOTA methods Frozen and LQAE, and the results are shown in Table 1 below. SPAEGPT outperforms LQAE on all tasks while using only 2% of tokens.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Overall, testing on the mini-ImageNet benchmark shows that the SPAE method outperforms the previous SOTA The method improves performance by 25%.

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

In order to verify the effectiveness of the SPAE design method, this study conducted an ablation experiment. The experimental results are as follows Table 4 and Shown in Figure 10:

The visual talent of large language models: GPT can also solve visual tasks through contextual learning Picture

Feeling Interested readers can read the original text of the paper to learn more about the research content.

The above is the detailed content of The visual talent of large language models: GPT can also solve visual tasks through contextual learning. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

AI Therapists Are Here: 14 Groundbreaking Mental Health Tools You Need To KnowApr 30, 2025 am 11:17 AM

While it can’t provide the human connection and intuition of a trained therapist, research has shown that many people are comfortable sharing their worries and concerns with relatively faceless and anonymous AI bots. Whether this is always a good i

Calling AI To The Grocery AisleApr 30, 2025 am 11:16 AM

Artificial intelligence (AI), a technology decades in the making, is revolutionizing the food retail industry. From large-scale efficiency gains and cost reductions to streamlined processes across various business functions, AI's impact is undeniabl

Getting Pep Talks From Generative AI To Lift Your SpiritApr 30, 2025 am 11:15 AM

Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here). In addition, for my comp

Why AI-Powered Hyper-Personalization Is A Must For All BusinessesApr 30, 2025 am 11:14 AM

Maintaining a professional image requires occasional wardrobe updates. While online shopping is convenient, it lacks the certainty of in-person try-ons. My solution? AI-powered personalization. I envision an AI assistant curating clothing selecti

Forget Duolingo: Google Translate's New AI Feature Teaches LanguagesApr 30, 2025 am 11:13 AM

Google Translate adds language learning function According to Android Authority, app expert AssembleDebug has found that the latest version of the Google Translate app contains a new "practice" mode of testing code designed to help users improve their language skills through personalized activities. This feature is currently invisible to users, but AssembleDebug is able to partially activate it and view some of its new user interface elements. When activated, the feature adds a new Graduation Cap icon at the bottom of the screen marked with a "Beta" badge indicating that the "Practice" feature will be released initially in experimental form. The related pop-up prompt shows "Practice the activities tailored for you!", which means Google will generate customized

They're Making TCP/IP For AI, And It's Called NANDAApr 30, 2025 am 11:12 AM

MIT researchers are developing NANDA, a groundbreaking web protocol designed for AI agents. Short for Networked Agents and Decentralized AI, NANDA builds upon Anthropic's Model Context Protocol (MCP) by adding internet capabilities, enabling AI agen

The Prompt: Deepfake Detection Is A Booming BusinessApr 30, 2025 am 11:11 AM

Meta's Latest Venture: An AI App to Rival ChatGPT Meta, the parent company of Facebook, Instagram, WhatsApp, and Threads, is launching a new AI-powered application. This standalone app, Meta AI, aims to compete directly with OpenAI's ChatGPT. Lever

The Next Two Years In AI Cybersecurity For Business LeadersApr 30, 2025 am 11:10 AM

Navigating the Rising Tide of AI Cyber Attacks Recently, Jason Clinton, CISO for Anthropic, underscored the emerging risks tied to non-human identities—as machine-to-machine communication proliferates, safeguarding these "identities" become

See all articles