search
HomeTechnology peripheralsAILearn multi-modal commands: Google image generation AI lets you easily follow along

Now there is a new image generation model designed by Google, which can draw the cat in Figure 1 in the style of Figure 2 and put a hat on it. This model uses instruction fine-tuning technology to accurately generate new images based on text instructions and multiple reference images. The effect is very good, comparable to a PS master personally helping you to create a picture.

Learn multi-modal commands: Google image generation AI lets you easily follow along

We have recognized the importance of instruction fine-tuning when using large language models (LLM). With appropriate fine-tuning of instructions, LLM can perform a variety of tasks, such as composing poetry, writing code, writing scripts, assisting in scientific research, and even conducting investment management.

Now that large models have entered the multi-modal era, is instruction fine-tuning still effective? For example, can we fine-tune control of image generation through multi-modal instructions? Unlike language generation, image generation involves multimodality from the beginning. Can we effectively enable models to grasp the complexity of multimodality?

In order to solve this problem, Google DeepMind and Google Research proposed a multi-modal instruction method to achieve image generation. This method interweaves information from different modalities to express the conditions for image generation (example shown in the left panel of Figure 1).

Multimodal instructions can enhance language instructions. For example, users can specify the style of the reference image to generate a model to render the image. This intuitive interactive interface enables efficient setting of multimodal conditions for image generation tasks.

Based on this idea, the team created a multi-modal instruction image generation model: Instruct-Imagen.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Paper address: https://arxiv.org/abs/2401.01952

Learn multi-modal commands: Google image generation AI lets you easily follow along

This model uses a Two-stage training method: first enhance the model's ability to handle multi-modal instructions, and then faithfully follow multi-modal user intentions.

In the first phase, the team adopted a pre-trained text-to-image model tasked with processing additional multi-modal inputs; later fine-tuning it to accurately respond to multi-modal status instructions. Specifically, the pre-trained model they took was a diffusion model and augmented with similar (image, text) context taken from a network-scale (image, text) corpus .

In the second phase, the team fine-tuned the model on a variety of image generation tasks, each of which was paired with corresponding multi-modal instructions—these instructions included the key to their respective tasks. elements. After the above steps, the resulting model Instruct-Imagen can very skillfully handle the fusion input of multiple modalities (such as sketches plus visual styles described with text instructions), so that it can generate images that accurately fit the context and are bright enough.

As shown in Figure 1, Instruct-Imagen performs exceptionally well, being able to understand complex multimodal instructions and generate images that faithfully follow human intent, even handling combinations of instructions that have never been seen before.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Human feedback shows that in many instances, Instruct-Imagen not only matches the performance of task-specific models on corresponding tasks, but even surpasses them. Not only that, Instruct-Imagen also shows strong generalization capabilities and can be used for unseen and more complex image generation tasks.

Multimodal instructions for generation

The pre-trained model used by the team is a diffusion model and users can set input conditions for it. For details, please see the original paper.

For multi-modal instructions, in order to ensure versatility and generalization capabilities, the team proposed a unified multi-modal instruction format, in which the role of language is to clearly state the goals of the task, multi-modal conditions It is provided as reference information.

This newly proposed command format contains two key components: (1) Payload text command, whose role is to describe the mission goal in detail and give reference information identification, such as [ref#?]. (2) Multimodal context, with paired (identity text, image). The model then uses a shared instruction understanding model to handle textual instructions and multimodal contexts—the specific modality of the context is not limited here.

Figure 2 shows how this format can represent various previous generation tasks through three examples, which shows that this format can be compatible with previous image generation tasks. More importantly, the language is flexible, so multimodal instructions can be extended for new tasks without any special design for modality and tasks.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Instruct-Imagen

Instruct-Imagen is based on multimodal instructions. Based on this, the team designed a model architecture based on a pre-trained text-to-image diffusion model, namely the cascaded diffusion model, so that it can fully adopt the input multi-modal instruction conditions.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Specifically, they used a variant version of Imagen, see the paper "Photorealistic text-to-image diffusion models with deep language understanding", and based on their Pre-trained on internal data sources. Its complete model contains two sub-components: (1) text-to-image component, whose task is to generate 128×128 resolution images using only text prompts; (2) text conditional super-resolution model, which can convert 128-resolution images into Upgrade to 1024 resolution.

As for the encoding of multi-modal instructions, see Figure 3 (right), which shows the data flow of Instruct-Imagen encoding multi-modal instructions.

Training Instruct-Imagen with a two-stage method

The training process of Instruct-Imagen is divided into two stages.

The first stage is retrieval-enhanced text-to-image training, which uses the enhanced retrieved neighbor (image, text) pairs to continue training text-to-image generation.

The second stage is to fine-tune the output model of the first stage, which will use a mixture of diverse image generation tasks, each of which is paired with corresponding multi-modal instructions. Specifically, the team used 11 images across 5 task categories to generate the dataset, see Table 1.

Learn multi-modal commands: Google image generation AI lets you easily follow along

In both training stages, the model is optimized end-to-end.

Experimentation

The team conducted an experimental evaluation of the newly proposed method and model, and conducted an in-depth analysis of the design and failure modes of Instruct-Imagen.

Experimental Settings

The team evaluated the model in two settings, namely in-domain task evaluation and zero-shot task evaluation, with the latter setting being more efficient than The former setup is more challenging.

Main results

Figure 4 compares Instruct-Imagen with the baseline method and previous methods. The results show that it is comparable to the previous method in terms of in-field evaluation and zero-sample evaluation. Methods.

Learn multi-modal commands: Google image generation AI lets you easily follow along

This shows that training with multimodal instructions can enhance model performance on tasks with limited training data (such as stylized generation), while maintaining performance on data-rich tasks (such as generating photo-like images). Without multi-modal instruction training, multi-task benchmarks tend to result in poor image quality and text alignment.

For example, in the in-context stylization example in Figure 5, the multi-task benchmark has difficulty distinguishing styles from objects, so the objects are reproduced in the generated results. For similar reasons, it also performs poorly on style transfer tasks. These observations highlight the value of instruction fine-tuning.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Unlike current methods or training that rely on specific tasks, Instruct-Imagen can be efficiently managed by leveraging instructions that combine the goals of different tasks and perform inference in context Combined task (no fine-tuning required, 18.2 seconds per example).

As shown in Figure 6, Instruct-Imagen always outperforms other models in terms of instruction following and output quality.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Not only that, when there are multiple references in a multi-modal context, the multi-task baseline model cannot correspond text instructions to references, resulting in some multi-modal The condition is ignored. These results further demonstrate the effectiveness of the newly proposed model.

Model Analysis and Ablation Study

The team analyzed the limitations and failure modes of the model.

For example, the team found that fine-tuned Instruct-Imagen can edit images. As shown in Table 2, by comparing the previous SDXL-inpainting, the Imagen fine-tuned on the MagicBrush dataset, and the fine-tuned Instruct-Imagen, it can be found that the fine-tuned Instruct-Imagen is significantly better than the one specifically designed for mask-based image editing. Design model.

Learn multi-modal commands: Google image generation AI lets you easily follow along

However, the fine-tuned Instruct-Imagen produces artifacts in the edited images, especially the high-resolution output after the super-resolution step, as shown in Figure 7. The researchers say this is because the model has not previously learned to accurately copy pixels directly from context.

Learn multi-modal commands: Google image generation AI lets you easily follow along

The team also found that retrieval-enhanced training can help improve generalization ability, and the results are shown in Table 3.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Regarding the failure mode of Instruct-Imagen, researchers found that when the multi-modal instructions are more complex (at least 3 multi-modal conditions), Instruct-Imagen is difficult to generate The result of following instructions. Figure 8 gives two examples.

Learn multi-modal commands: Google image generation AI lets you easily follow along

#The following shows some results on complex tasks that have not been seen during training.

Learn multi-modal commands: Google image generation AI lets you easily follow along

Learn multi-modal commands: Google image generation AI lets you easily follow along

Learn multi-modal commands: Google image generation AI lets you easily follow along

The team also conducted ablation studies to prove the importance of its design components.

However, due to security concerns, Google has not yet released the code and API of this research.

Learn multi-modal commands: Google image generation AI lets you easily follow along

See original paper for more details.

The above is the detailed content of Learn multi-modal commands: Google image generation AI lets you easily follow along. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserTesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailSam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsExploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaThe Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressAI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarJamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityNew Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software