


Vincent Tu's new SOTA! Pika, Peking University and Stanford jointly launch RPG, multi-modal to help solve two major problems of Wenshengtu
Recently, Peking University, Stanford, and the popular Pika Labs jointly published a study that takes the capabilities of large-model Vincentian graphs to a new level.
Paper address: https://arxiv.org/pdf/2401.11708.pdf
Code Address: https://github.com/YangLing0818/RPG-DiffusionMaster
The author of the paper proposed an innovative method, using the reasoning capabilities of the multi-modal large language model (MLLM), to improve the text-to-image generation/editing framework.
In other words, this method aims to improve the performance of text generation models when processing complex text prompts containing multiple attributes, relationships, and objects.
Without further ado, here’s the picture:
A green twintail girl in orange dress is sitting on the sofa while a messy desk under a big window on the left, a lively aquarium is on the top right of the sofa, realistic style.
A wearing orange dress girl with twin tails is sitting on the sofa, next to the big window is a messy desk, with a lively aquarium on the upper right, room style realism.
# Faced with multiple objects with complex relationships, the structure of the entire picture and the relationship between people and objects given by the model are very reasonable, making the viewer's eyes shine.
And for the same prompt, let’s take a look at the performance of the current state-of-the-art SDXL and DALL·E 3:
Let’s take a look at the performance of the new framework when binding multiple properties to multiple objects:
From left to right, a blonde ponytail European girl in white shirt, a brown curly hair African girl in blue shirt printed with a bird, an Asian young man with black short hair in suit are walking in the campus happily.
From left to right, a European girl wearing a white shirt with a blond ponytail, an African girl with brown curly hair wearing a blue shirt with a bird printed on it, and an Asian girl wearing a suit with short black hair. Young people are walking happily on campus.
The researchers named this framework RPG (Recaption, Plan and Generate), using MLLM as the global planner to decompose the complex image generation process into multiple sub-regions. A simpler build task.
The paper proposes complementary regional diffusion to achieve regional combination generation, and also integrates text-guided image generation and editing into the RPG framework in a closed-loop manner , thus enhancing the generalization ability.
Experiments show that the RPG framework proposed in this article is better than the current state-of-the-art text image diffusion models, including DALL·E 3 and SDXL, especially in multi-category object synthesis and text image semantics Alignment aspect.
It is worth noting that the RPG framework is widely compatible with various MLLM architectures (such as MiniGPT-4) and diffusion backbone networks (such as ControlNet).
RPG
#The current Vincentian graph model mainly has two problems: 1. Layout-based or attention-based methods can only provide rough spatial guidance and are difficult to Handle overlapping objects; 2. Feedback-based methods require collecting high-quality feedback data and incur additional training costs.
In order to solve these problems, researchers proposed three core strategies of RPG, as shown in the figure below:
Given a complex text prompt containing multiple entities and relationships, MLLM is first used to decompose it into basic prompts and highly descriptive sub-prompts; subsequently, the CoT planning of the multi-modal model is used to divide the image space into Complementary sub-regions; finally, complementary region diffusion is introduced to generate images of each sub-region independently and aggregate at each sampling step.
Multi-modal re-tuning
Convert textual cues into highly descriptive cues, providing information-enhanced cue understanding and semantic alignment in diffusion models.
Use MLLM to identify key phrases in user prompt y and obtain the sub-items:
# #Use LLM to decompose the text prompt into different sub-prompts and redescribe them in more detail:
In this way, you can Generate denser fine-grained details for each sub-cue to effectively increase the fidelity of the generated images and reduce the semantic differences between cues and images.
Thought chain planning
Divide the image space into complementary sub-regions and assign different sub-prompts while breaking down the build task into multiple simpler sub-tasks.
Specifically, the image space H × W is divided into several complementary regions, and each enhancer prompt is assigned to a specific region R:
Use MLLM’s powerful thinking chain reasoning capabilities to carry out effective regional division. By analyzing the retrieved intermediate results, detailed principles and precise instructions can be generated for subsequent image synthesis.
Supplementary Area Diffusion
In each rectangular sub-area, content guided by sub-cues is independently generated and subsequently resized and connected. , spatially merge these sub-regions.
This method effectively solves the problem of large models having difficulty processing overlapping objects. Furthermore, the paper extends this framework to adapt to editing tasks, employing contour-based region diffusion to precisely operate on inconsistent regions that need modification.
Text-guided image editing
As shown in the image above. In the retelling stage, RPG uses MLLM as subtitles to retell the source image, and uses its powerful reasoning capabilities to identify fine-grained semantic differences between the image and the target cue, directly analyzing how the input image aligns with the target cue.
Use MLLM (GPT-4, Gemini Pro, etc.) to check differences between input and target regarding numerical accuracy, property bindings, and object relationships. The resulting multimodal understanding feedback will be delivered to the MLLM for inferential editing planning.
Let’s take a look at the performance of the generation effect in the above three aspects. The first is attribute binding, comparing SDXL, DALL·E 3 and LMD:
We can see that across all three tests, only the RPG most accurately reflects what the prompts describe.
Then there is numerical accuracy, the display order is the same as above (SDXL, DALL·E 3, LMD, RPG):
——I didn’t expect that counting would be quite difficult for the large model of Vincent. The RPG easily defeated the opponent.
The last item is the complex relationship in the restore prompt:
In addition, you can also Diffusion expands into a hierarchical format, dividing a specific sub-region into smaller sub-regions.
As shown in the figure below, when adding a hierarchy of region segmentation, RPG can achieve significant improvements in text-to-image generation. This provides a new perspective for handling complex generation tasks, making it possible to generate images of arbitrary composition.
The above is the detailed content of Vincent Tu's new SOTA! Pika, Peking University and Stanford jointly launch RPG, multi-modal to help solve two major problems of Wenshengtu. For more information, please follow other related articles on the PHP Chinese website!

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

This "Leading with Data" episode features Ines Montani, co-founder and CEO of Explosion AI, and co-developer of spaCy and Prodigy. Ines offers expert insights into the evolution of these tools, Explosion's unique business model, and the tr

This article explores Retrieval Augmented Generation (RAG) systems and how AI agents can enhance their capabilities. Traditional RAG systems, while useful for leveraging custom enterprise data, suffer from limitations such as a lack of real-time dat


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools

Atom editor mac version download
The most popular open source editor

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software