


rare! Apple's open-source image editing tool MGIE, is it going to be available on the iPhone?
Take a photo, enter a text command, and the phone will start automatically retouching the photo?
This magical feature comes from Apple’s newly open-sourced image editing tool “MGIE”.
Remove people in the background
In Add pizza to the table
Recently, AI has made significant progress in image editing. On the one hand, through multi-modal large models (MLLM), AI can take images as input and provide visual perception responses, thereby achieving more natural picture editing. On the other hand, instruction-based editing technology makes the editing process no longer rely on detailed descriptions or area masks, but allows users to directly issue instructions to express editing methods and goals. This method is very practical because it is more in line with the intuitive way of humans. Through these innovative technologies, AI is gradually becoming people's right-hand assistant in the field of picture editing.
Based on the inspiration of the above technology, Apple proposed MGIE (MLLM-Guided Image Editing), using MLLM to solve the problem of insufficient instruction guidance.
- Paper title: Guiding Instruction-based Image Editing via Multimodal Large Language Models
- Paper link: https://openreview.net/pdf?id=S1RKWSyZ2Y
- Project homepage: https://mllm-ie.github.io/
MGIE (Mind-Guided Image Editing) consists of MLLM (Mind-Language Linking Model) and diffusion model, as shown in Figure 2. MLLM learns to acquire concise expression instructions and provides clear, visually relevant guidance. The diffusion model performs image editing using the latent imagination of the intended target and is updated synchronously through end-to-end training. In this way, MGIE is able to benefit from inherent visual derivation and resolve ambiguous human instructions to achieve sensible editing.
Guided by human commands, MGIE can perform Photoshop-style modifications, global photo optimization, and local object modifications. Take the picture below as an example. It is difficult to capture the meaning of "healthy" without additional context, but MGIE can accurately associate "vegetable toppings" with pizza and edit it accordingly to human expectations.
##This reminds us of the "ambition" Cook expressed on the earnings call not long ago: "I I think there is a huge opportunity for Apple in generative AI, but I don’t want to go into more details.” The information he revealed included that Apple is actively developing generative AI software features, and these features will be available to Apple later in 2024. Customer provided.
Combined with a series of generative AI theoretical research results released by Apple in recent times, it seems that we are looking forward to the new AI functions that Apple will release next.
Paper details
The MGIE method proposed in this study can edit the input image V into the target image through the given instruction X . For those imprecise instructions, MLLM in MGIE will perform learning derivation to obtain concise expression instructions ε. In order to build a bridge between language and visual modalities, the researchers also added a special token [IMG] after ε and used the edit head
to convert them. The transformed information will serve as the underlying visual imagination in MLLM, guiding the diffusion model
to achieve the desired editing goals. MGIE is then able to understand visually aware fuzzy commands to perform reasonable image editing (the architecture diagram is shown in Figure 2 above).
Concise expression of instructions
Through feature alignment and instruction adjustment, MLLM can provide cross-modal perception and vision Relevant responses. For image editing, the study uses the prompt "what will this image be like if [instruction]" as the language input for the image and derives detailed explanations of the editing commands. However, these explanations are often too lengthy and even mislead the user’s intent. To obtain a more concise description, this study applies a pretrained summarizer to let MLLM learn to generate summary output. This process can be summarized as follows:
Image editing through potential imagination
The study uses editorial heads to transform [IMG] into actual visual guidance. where
is a sequence-to-sequence model that maps continuous visual tokens from MLLM to semantically meaningful latent U = {u_1, u_2, ..., u_L} and serves as an editing guide :
In order to realize the process of guiding image editing through visual imagination, this study considers using the diffusion model , this model can also solve the denoising diffusion problem in the latent space while including a variational autoencoder (VAE).
Algorithm 1 shows the MGIE learning process. MLLM derives compact instructions ε via instruction losses L_ins. Leveraging the underlying imagination of [IMG] transforms its modalities and guides the
synthesis of the resulting image. The edit loss L_edit is used for diffusion training. Since most weights can be frozen (self-attention blocks within MLLM), parameter-efficient end-to-end training is achieved.
Experimental evaluation
For input images, under the same instructions, the difference between different methods Compare, for example, the first line of instructions is "turn day into night":
Table 1 shows the zero-shot editing results of the model trained only on the dataset IPr2Pr. For EVR and GIER involving Photoshop-style modifications, the editing results were closer to the bootstrapping intent (e.g., LGIE achieved a higher CVS of 82.0 on EVR). For global image optimization on MA5k, InsPix2Pix is intractable due to the scarcity of relevant training triples. LGIE and MGIE can provide detailed explanations through the learning of LLM, but LGIE is still limited to its single modality. By accessing the image, MGIE can derive explicit instructions such as which areas should be brightened or which objects should be clearer, resulting in significant performance improvements (e.g., higher 66.3 SSIM and lower 0.3 photo distance), in Similar results were found on MagicBrush. MGIE also obtains the best performance from precise visual imagery and modification of specified targets as targets (e.g., higher 82.2 DINO visual similarity and higher 30.4 CTS global subtitle alignment).
#To study instruction-based image editing for specific purposes, Table 2 fine-tunes the model on each dataset. For EVR and GIER, all models improved when adapted to Photoshop-style editing tasks. MGIE consistently outperforms LGIE in every aspect of editing. This also illustrates that learning using expressive instructions can effectively enhance image editing, and that visual perception plays a crucial role in obtaining explicit guidance for maximal enhancement.
Trade-off between α_X and α_V. Image editing has two goals: manipulating the target as an instruction and retaining the remainder of the input image. Figure 3 shows the trade-off curve between instruction (α_X) and input consistency (α_V). This study fixed α_X at 7.5 and α_V varied in the range [1.0, 2.2]. The larger α_V is, the more similar the editing result is to the input, but the less consistent it is with the instruction. The X-axis calculates the CLIP directional similarity, that is, how consistent the editing results are with the instructions; the Y-axis is the feature similarity between the CLIP visual encoder and the input image. With specific expression instructions, the experiments outperform InsPix2Pix in all settings. In addition, MGIE can learn through explicit visual guidance, allowing for overall improvement. This supports robust improvements whether requiring greater input or editing relevance.
Ablation research
Besides ,The researchers also conducted ablation experiments, ,considering the performance of different architectures FZ, FT, and ,E2E in expressing instructions. The results show that MGIE consistently exceeds LGIE in FZ, FT, and E2E. This suggests that expressive instructions with critical visual perception have a consistent advantage across all ablation settings.
Why is MLLM bootstrapping useful? Figure 5 shows the CLIP-Score values between input or ground-truth target images and expression instructions. A higher CLIP-S score for the input image indicates that the instructions are relevant to the editing source, while better alignment with the target image provides clear, relevant editing guidance. As shown, MGIE is more consistent with the input/goal, which explains why its expressive instructions are helpful. With a clear narrative of expected results, MGIE can achieve the greatest improvements in image editing.
Human evaluation. In addition to automatic indicators, the researchers also performed manual evaluation. Figure 6 shows the quality of the generated expression instructions, and Figure 7 compares the image editing results of InsPix2Pix, LGIE, and MGIE in terms of instruction following, ground-truth relevance, and overall quality.
Inference efficiency. Although MGIE relies on MLLM to drive image editing, it only introduces concise expression instructions (less than 32 tokens), so the efficiency is comparable to InsPix2Pix. Table 4 lists the inference time costs on the NVIDIA A100 GPU. For a single input, MGIE can complete the editing task in 10 seconds. With more data parallelism, the time required is similar (37 seconds with a batch size of 8). The entire process can be completed with just one GPU (40GB).
Qualitative comparison. Figure 8 shows a visual comparison of all used datasets, and Figure 9 further compares the expression instructions of LGIE or MGIE.
##On the project homepage, the researcher also provides more demos (https://mllm- ie.github.io/). For more research details, please refer to the original paper.
The above is the detailed content of rare! Apple's open-source image editing tool MGIE, is it going to be available on the iPhone?. For more information, please follow other related articles on the PHP Chinese website!

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

Atom editor mac version download
The most popular open source editor

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

Zend Studio 13.0.1
Powerful PHP integrated development environment

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software