Home >Technology peripherals >AI >Learn multi-modal commands: Google image generation AI lets you easily follow along
Now there is a new image generation model designed by Google, which can draw the cat in Figure 1 in the style of Figure 2 and put a hat on it. This model uses instruction fine-tuning technology to accurately generate new images based on text instructions and multiple reference images. The effect is very good, comparable to a PS master personally helping you to create a picture.
We have recognized the importance of instruction fine-tuning when using large language models (LLM). With appropriate fine-tuning of instructions, LLM can perform a variety of tasks, such as composing poetry, writing code, writing scripts, assisting in scientific research, and even conducting investment management.
Now that large models have entered the multi-modal era, is instruction fine-tuning still effective? For example, can we fine-tune control of image generation through multi-modal instructions? Unlike language generation, image generation involves multimodality from the beginning. Can we effectively enable models to grasp the complexity of multimodality?
In order to solve this problem, Google DeepMind and Google Research proposed a multi-modal instruction method to achieve image generation. This method interweaves information from different modalities to express the conditions for image generation (example shown in the left panel of Figure 1).
Multimodal instructions can enhance language instructions. For example, users can specify the style of the reference image to generate a model to render the image. This intuitive interactive interface enables efficient setting of multimodal conditions for image generation tasks.
Based on this idea, the team created a multi-modal instruction image generation model: Instruct-Imagen.
Paper address: https://arxiv.org/abs/2401.01952
This model uses a Two-stage training method: first enhance the model's ability to handle multi-modal instructions, and then faithfully follow multi-modal user intentions.
In the first phase, the team adopted a pre-trained text-to-image model tasked with processing additional multi-modal inputs; later fine-tuning it to accurately respond to multi-modal status instructions. Specifically, the pre-trained model they took was a diffusion model and augmented with similar (image, text) context taken from a network-scale (image, text) corpus .
In the second phase, the team fine-tuned the model on a variety of image generation tasks, each of which was paired with corresponding multi-modal instructions—these instructions included the key to their respective tasks. elements. After the above steps, the resulting model Instruct-Imagen can very skillfully handle the fusion input of multiple modalities (such as sketches plus visual styles described with text instructions), so that it can generate images that accurately fit the context and are bright enough.
As shown in Figure 1, Instruct-Imagen performs exceptionally well, being able to understand complex multimodal instructions and generate images that faithfully follow human intent, even handling combinations of instructions that have never been seen before.
Human feedback shows that in many instances, Instruct-Imagen not only matches the performance of task-specific models on corresponding tasks, but even surpasses them. Not only that, Instruct-Imagen also shows strong generalization capabilities and can be used for unseen and more complex image generation tasks.
Multimodal instructions for generation
The pre-trained model used by the team is a diffusion model and users can set input conditions for it. For details, please see the original paper.
For multi-modal instructions, in order to ensure versatility and generalization capabilities, the team proposed a unified multi-modal instruction format, in which the role of language is to clearly state the goals of the task, multi-modal conditions It is provided as reference information.
This newly proposed command format contains two key components: (1) Payload text command, whose role is to describe the mission goal in detail and give reference information identification, such as [ref#?]. (2) Multimodal context, with paired (identity text, image). The model then uses a shared instruction understanding model to handle textual instructions and multimodal contexts—the specific modality of the context is not limited here.
Figure 2 shows how this format can represent various previous generation tasks through three examples, which shows that this format can be compatible with previous image generation tasks. More importantly, the language is flexible, so multimodal instructions can be extended for new tasks without any special design for modality and tasks.
Instruct-Imagen
Instruct-Imagen is based on multimodal instructions. Based on this, the team designed a model architecture based on a pre-trained text-to-image diffusion model, namely the cascaded diffusion model, so that it can fully adopt the input multi-modal instruction conditions.
Specifically, they used a variant version of Imagen, see the paper "Photorealistic text-to-image diffusion models with deep language understanding", and based on their Pre-trained on internal data sources. Its complete model contains two sub-components: (1) text-to-image component, whose task is to generate 128×128 resolution images using only text prompts; (2) text conditional super-resolution model, which can convert 128-resolution images into Upgrade to 1024 resolution.
As for the encoding of multi-modal instructions, see Figure 3 (right), which shows the data flow of Instruct-Imagen encoding multi-modal instructions.
Training Instruct-Imagen with a two-stage method
The training process of Instruct-Imagen is divided into two stages.
The first stage is retrieval-enhanced text-to-image training, which uses the enhanced retrieved neighbor (image, text) pairs to continue training text-to-image generation.
The second stage is to fine-tune the output model of the first stage, which will use a mixture of diverse image generation tasks, each of which is paired with corresponding multi-modal instructions. Specifically, the team used 11 images across 5 task categories to generate the dataset, see Table 1.
In both training stages, the model is optimized end-to-end.
Experimentation
The team conducted an experimental evaluation of the newly proposed method and model, and conducted an in-depth analysis of the design and failure modes of Instruct-Imagen.
Experimental Settings
The team evaluated the model in two settings, namely in-domain task evaluation and zero-shot task evaluation, with the latter setting being more efficient than The former setup is more challenging.
Main results
Figure 4 compares Instruct-Imagen with the baseline method and previous methods. The results show that it is comparable to the previous method in terms of in-field evaluation and zero-sample evaluation. Methods.
This shows that training with multimodal instructions can enhance model performance on tasks with limited training data (such as stylized generation), while maintaining performance on data-rich tasks (such as generating photo-like images). Without multi-modal instruction training, multi-task benchmarks tend to result in poor image quality and text alignment.
For example, in the in-context stylization example in Figure 5, the multi-task benchmark has difficulty distinguishing styles from objects, so the objects are reproduced in the generated results. For similar reasons, it also performs poorly on style transfer tasks. These observations highlight the value of instruction fine-tuning.
Unlike current methods or training that rely on specific tasks, Instruct-Imagen can be efficiently managed by leveraging instructions that combine the goals of different tasks and perform inference in context Combined task (no fine-tuning required, 18.2 seconds per example).
As shown in Figure 6, Instruct-Imagen always outperforms other models in terms of instruction following and output quality.
Not only that, when there are multiple references in a multi-modal context, the multi-task baseline model cannot correspond text instructions to references, resulting in some multi-modal The condition is ignored. These results further demonstrate the effectiveness of the newly proposed model.
Model Analysis and Ablation Study
The team analyzed the limitations and failure modes of the model.
For example, the team found that fine-tuned Instruct-Imagen can edit images. As shown in Table 2, by comparing the previous SDXL-inpainting, the Imagen fine-tuned on the MagicBrush dataset, and the fine-tuned Instruct-Imagen, it can be found that the fine-tuned Instruct-Imagen is significantly better than the one specifically designed for mask-based image editing. Design model.
However, the fine-tuned Instruct-Imagen produces artifacts in the edited images, especially the high-resolution output after the super-resolution step, as shown in Figure 7. The researchers say this is because the model has not previously learned to accurately copy pixels directly from context.
The team also found that retrieval-enhanced training can help improve generalization ability, and the results are shown in Table 3.
Regarding the failure mode of Instruct-Imagen, researchers found that when the multi-modal instructions are more complex (at least 3 multi-modal conditions), Instruct-Imagen is difficult to generate The result of following instructions. Figure 8 gives two examples.
#The following shows some results on complex tasks that have not been seen during training.
The team also conducted ablation studies to prove the importance of its design components.
However, due to security concerns, Google has not yet released the code and API of this research.
See original paper for more details.
The above is the detailed content of Learn multi-modal commands: Google image generation AI lets you easily follow along. For more information, please follow other related articles on the PHP Chinese website!