Home > Article > Technology peripherals > rare! Apple’s open-source image editing tool MGIE, is it going to be available on the iPhone?
Take a photo, enter a text command, and the phone will start automatically retouching the photo?
This magical feature comes from Apple’s newly open-sourced image editing tool “MGIE”.
Remove people in the background
In Add pizza to the table
Recently, AI has made significant progress in image editing. On the one hand, through multi-modal large models (MLLM), AI can take images as input and provide visual perception responses, thereby achieving more natural picture editing. On the other hand, instruction-based editing technology makes the editing process no longer rely on detailed descriptions or area masks, but allows users to directly issue instructions to express editing methods and goals. This method is very practical because it is more in line with the intuitive way of humans. Through these innovative technologies, AI is gradually becoming people's right-hand assistant in the field of picture editing.
Based on the inspiration of the above technology, Apple proposed MGIE (MLLM-Guided Image Editing), using MLLM to solve the problem of insufficient instruction guidance.
MGIE (Mind-Guided Image Editing) consists of MLLM (Mind-Language Linking Model) and diffusion model, as shown in Figure 2. MLLM learns to acquire concise expression instructions and provides clear, visually relevant guidance. The diffusion model performs image editing using the latent imagination of the intended target and is updated synchronously through end-to-end training. In this way, MGIE is able to benefit from inherent visual derivation and resolve ambiguous human instructions to achieve sensible editing.
Guided by human commands, MGIE can perform Photoshop-style modifications, global photo optimization, and local object modifications. Take the picture below as an example. It is difficult to capture the meaning of "healthy" without additional context, but MGIE can accurately associate "vegetable toppings" with pizza and edit it accordingly to human expectations.
##This reminds us of the "ambition" Cook expressed on the earnings call not long ago: "I I think there is a huge opportunity for Apple in generative AI, but I don’t want to go into more details.” The information he revealed included that Apple is actively developing generative AI software features, and these features will be available to Apple later in 2024. Customer provided.
Combined with a series of generative AI theoretical research results released by Apple in recent times, it seems that we are looking forward to the new AI functions that Apple will release next.
The MGIE method proposed in this study can edit the input image V into the target image through the given instruction X . For those imprecise instructions, MLLM in MGIE will perform learning derivation to obtain concise expression instructions ε. In order to build a bridge between language and visual modalities, the researchers also added a special token [IMG] after ε and used the edit head to convert them. The transformed information will serve as the underlying visual imagination in MLLM, guiding the diffusion model to achieve the desired editing goals. MGIE is then able to understand visually aware fuzzy commands to perform reasonable image editing (the architecture diagram is shown in Figure 2 above).
Concise expression of instructions
Through feature alignment and instruction adjustment, MLLM can provide cross-modal perception and vision Relevant responses. For image editing, the study uses the prompt "what will this image be like if [instruction]" as the language input for the image and derives detailed explanations of the editing commands. However, these explanations are often too lengthy and even mislead the user’s intent. To obtain a more concise description, this study applies a pretrained summarizer to let MLLM learn to generate summary output. This process can be summarized as follows:
Image editing through potential imagination
The study uses editorial heads to transform [IMG] into actual visual guidance. where is a sequence-to-sequence model that maps continuous visual tokens from MLLM to semantically meaningful latent U = {u_1, u_2, ..., u_L} and serves as an editing guide :
In order to realize the process of guiding image editing through visual imagination, this study considers using the diffusion model , this model can also solve the denoising diffusion problem in the latent space while including a variational autoencoder (VAE).
Algorithm 1 shows the MGIE learning process. MLLM derives compact instructions ε via instruction losses L_ins. Leveraging the underlying imagination of [IMG] transforms its modalities and guides the synthesis of the resulting image. The edit loss L_edit is used for diffusion training. Since most weights can be frozen (self-attention blocks within MLLM), parameter-efficient end-to-end training is achieved.
For input images, under the same instructions, the difference between different methods Compare, for example, the first line of instructions is "turn day into night":
Table 1 shows the zero-shot editing results of the model trained only on the dataset IPr2Pr. For EVR and GIER involving Photoshop-style modifications, the editing results were closer to the bootstrapping intent (e.g., LGIE achieved a higher CVS of 82.0 on EVR). For global image optimization on MA5k, InsPix2Pix is intractable due to the scarcity of relevant training triples. LGIE and MGIE can provide detailed explanations through the learning of LLM, but LGIE is still limited to its single modality. By accessing the image, MGIE can derive explicit instructions such as which areas should be brightened or which objects should be clearer, resulting in significant performance improvements (e.g., higher 66.3 SSIM and lower 0.3 photo distance), in Similar results were found on MagicBrush. MGIE also obtains the best performance from precise visual imagery and modification of specified targets as targets (e.g., higher 82.2 DINO visual similarity and higher 30.4 CTS global subtitle alignment).
#To study instruction-based image editing for specific purposes, Table 2 fine-tunes the model on each dataset. For EVR and GIER, all models improved when adapted to Photoshop-style editing tasks. MGIE consistently outperforms LGIE in every aspect of editing. This also illustrates that learning using expressive instructions can effectively enhance image editing, and that visual perception plays a crucial role in obtaining explicit guidance for maximal enhancement.
Trade-off between α_X and α_V. Image editing has two goals: manipulating the target as an instruction and retaining the remainder of the input image. Figure 3 shows the trade-off curve between instruction (α_X) and input consistency (α_V). This study fixed α_X at 7.5 and α_V varied in the range [1.0, 2.2]. The larger α_V is, the more similar the editing result is to the input, but the less consistent it is with the instruction. The X-axis calculates the CLIP directional similarity, that is, how consistent the editing results are with the instructions; the Y-axis is the feature similarity between the CLIP visual encoder and the input image. With specific expression instructions, the experiments outperform InsPix2Pix in all settings. In addition, MGIE can learn through explicit visual guidance, allowing for overall improvement. This supports robust improvements whether requiring greater input or editing relevance.
Ablation research
Besides ,The researchers also conducted ablation experiments, ,considering the performance of different architectures FZ, FT, and ,E2E in expressing instructions. The results show that MGIE consistently exceeds LGIE in FZ, FT, and E2E. This suggests that expressive instructions with critical visual perception have a consistent advantage across all ablation settings.
Why is MLLM bootstrapping useful? Figure 5 shows the CLIP-Score values between input or ground-truth target images and expression instructions. A higher CLIP-S score for the input image indicates that the instructions are relevant to the editing source, while better alignment with the target image provides clear, relevant editing guidance. As shown, MGIE is more consistent with the input/goal, which explains why its expressive instructions are helpful. With a clear narrative of expected results, MGIE can achieve the greatest improvements in image editing.
Human evaluation. In addition to automatic indicators, the researchers also performed manual evaluation. Figure 6 shows the quality of the generated expression instructions, and Figure 7 compares the image editing results of InsPix2Pix, LGIE, and MGIE in terms of instruction following, ground-truth relevance, and overall quality.
Inference efficiency. Although MGIE relies on MLLM to drive image editing, it only introduces concise expression instructions (less than 32 tokens), so the efficiency is comparable to InsPix2Pix. Table 4 lists the inference time costs on the NVIDIA A100 GPU. For a single input, MGIE can complete the editing task in 10 seconds. With more data parallelism, the time required is similar (37 seconds with a batch size of 8). The entire process can be completed with just one GPU (40GB).
Qualitative comparison. Figure 8 shows a visual comparison of all used datasets, and Figure 9 further compares the expression instructions of LGIE or MGIE.
##On the project homepage, the researcher also provides more demos (https://mllm- ie.github.io/). For more research details, please refer to the original paper.
The above is the detailed content of rare! Apple’s open-source image editing tool MGIE, is it going to be available on the iPhone?. For more information, please follow other related articles on the PHP Chinese website!