New title: TextDiffuser: No fear of text in images, providing higher quality text rendering-AI-php.cn

Home

Technology peripherals

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

王林

Sep 26, 2023 pm 11:53 PM

aiModel

The field of Text-to-Image has made tremendous progress over the past few years, especially in the era of artificial intelligence-generated content (AIGC). With the rise of the DALL-E model, more and more Text-to-Image models have emerged in the academic community, such as Imagen, Stable Diffusion, ControlNet and other models. However, despite the rapid development of the Text-to-Image field, existing models still face some challenges in stably generating images containing text

After trying the existing sota Vincent graph model, we can find , the text generated by the model is basically unreadable, similar to garbled characters, which greatly affects the overall aesthetics of the image.

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

The text information generated by the existing sota text generation model is poorly readable

After Survey, there are few studies in this area in academia. In fact, images containing text are very common in daily life, such as posters, book covers, and street signs. If AI can effectively generate such images, it will help assist designers in their work, inspire design inspiration, and reduce design burden. In addition, users may only wish to modify the text portion of the Vincent diagram model results and retain the results in other non-text areas.

In order not to change the original meaning, the content needs to be rewritten into Chinese. The original sentence does not need to appear

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

Paper address: https://arxiv.org/abs/2305.10855
Project address: https://jingyechen.github.io/textdiffuser/
Code address: https://github.com/microsoft/unilm/tree/master/textdiffuser
Demo address: https://huggingface.co/spaces/microsoft/ TextDiffuser

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

##The three functions of TextDiffuser

This article proposes The TextDiffuser model consists of two stages, the first stage generates Layout, and the second stage generates images.

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

What needs to be rewritten is: TextDiffuser frame diagram

The model accepts a text Prompt , and then determine the Layout (that is, the coordinate frame) of each keyword based on the keywords in the Prompt. The researchers used Layout Transformer, used an encoder-decoder form to autoregressively output the coordinate box of keywords, and used Python's PILLOW library to render the text. In this process, you can also use Pillow's ready-made API to get the coordinate box of each character, which is equivalent to getting the character-level Box-level segmentation mask. Based on this information, the researchers tried to fine-tune Stable Diffusion.

They considered two situations. One is that the user wants to directly generate the entire image (called Whole-Image Generation). Another situation is Part-Image Generation, also called Text-inpainting in the paper, which means that the user gives an image and needs to modify certain text areas in the image.

In order to achieve the above two goals, the researchers redesigned the input features and increased the dimension from the original 4 dimensions to 17 dimensions. These include 4-dimensional noisy image features, 8-dimensional character information, 1-dimensional image mask, and 4-dimensional unmasked image features. If it is a whole image generation, the researchers set the mask area to the entire image; conversely, if it is a partial image generation, only a part of the image is masked. The training process of the diffusion model is similar to LDM. Friends who are interested in this can refer to the method section description in the original article

In the inference stage, TextDiffuser has a very flexible way of use and can be divided into Three types:

Generate images based on instructions given by the user. Moreover, if the user is not satisfied with the layout generated in the first step of Layout Generation, the user can change the coordinates and the content of the text, which increases the controllability of the model.
Start directly from the second stage. The final result is generated based on the template image, where the template image can be a printed text image, a handwritten text image, or a scene text image. The researchers specially trained a character set segmentation network to extract layout from template images.
Also starts from the second stage, the user gives the image and specifies the area and text content that needs to be modified. And, this operation can be performed multiple times until the user is satisfied with the generated results.

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

Constructed MARIO data

To train TextDiffuser, the researchers collected a thousand Ten thousand text images, as shown in the figure above, include three subsets: MARIO-LAION, MARIO-TMDB and MARIO-OpenLibrary

The researchers considered several aspects when filtering the data: for example After the image undergoes OCR, only images with a text quantity of [1,8] are retained. They filtered out texts with more than 8 texts, because these texts often contain a large amount of dense text, and the OCR results are generally less accurate, such as newspapers or complex design drawings. In addition, they set the text area to be greater than 10%. This rule is set to prevent the text area from being too small in the image.

After training on the MARIO-10M dataset, the researchers conducted quantitative and qualitative comparisons of TextDiffuser with existing methods. For example, in the overall image generation task, the images generated by this method have clearer and readable text, and the integration of the text area and the background area is better, as shown in the following figure

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

Comparison of text rendering performance with existing work

The researchers also conducted a series of qualitative experiments, and the results are shown in Table 1. Evaluation indicators include FID, CLIPScore and OCR. Especially for the OCR index, this research method has significantly improved compared to the comparative method

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

Rewritten content: The experimental results are shown in Table 1: Qualitative Experiment

For the Part-Image Generation task, the researchers tried to add or modify characters on a given image. The experimental results showed that the results generated by TextDiffuser were very natural.

New title: TextDiffuser: No fear of text in images, providing higher quality text rendering

Visualization of text repair function

In general, this article proposes The TextDiffuser model has made significant progress in the field of text rendering, capable of generating high-quality images containing readable text. In the future, researchers will further improve the effect of TextDiffuser.

The above is the detailed content of New title: TextDiffuser: No fear of text in images, providing higher quality text rendering. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

How to Build Your Personal AI Assistant with Huggingface SmolLMApr 18, 2025 am 11:52 AM

Harness the Power of On-Device AI: Building a Personal Chatbot CLI In the recent past, the concept of a personal AI assistant seemed like science fiction. Imagine Alex, a tech enthusiast, dreaming of a smart, local AI companion—one that doesn't rely

AI For Mental Health Gets Attentively Analyzed Via Exciting New Initiative At Stanford UniversityApr 18, 2025 am 11:49 AM

Their inaugural launch of AI4MH took place on April 15, 2025, and luminary Dr. Tom Insel, M.D., famed psychiatrist and neuroscientist, served as the kick-off speaker. Dr. Insel is renowned for his outstanding work in mental health research and techno

The 2025 WNBA Draft Class Enters A League Growing And Fighting Online HarassmentApr 18, 2025 am 11:44 AM

"We want to ensure that the WNBA remains a space where everyone, players, fans and corporate partners, feel safe, valued and empowered," Engelbert stated, addressing what has become one of women's sports' most damaging challenges. The anno

Comprehensive Guide to Python Built-in Data Structures - Analytics VidhyaApr 18, 2025 am 11:43 AM

Introduction Python excels as a programming language, particularly in data science and generative AI. Efficient data manipulation (storage, management, and access) is crucial when dealing with large datasets. We've previously covered numbers and st

First Impressions From OpenAI's New Models Compared To AlternativesApr 18, 2025 am 11:41 AM

Before diving in, an important caveat: AI performance is non-deterministic and highly use-case specific. In simpler terms, Your Mileage May Vary. Don't take this (or any other) article as the final word—instead, test these models on your own scenario

AI Portfolio | How to Build a Portfolio for an AI Career?Apr 18, 2025 am 11:40 AM

Building a Standout AI/ML Portfolio: A Guide for Beginners and Professionals Creating a compelling portfolio is crucial for securing roles in artificial intelligence (AI) and machine learning (ML). This guide provides advice for building a portfolio

What Agentic AI Could Mean For Security OperationsApr 18, 2025 am 11:36 AM

The result? Burnout, inefficiency, and a widening gap between detection and action. None of this should come as a shock to anyone who works in cybersecurity. The promise of agentic AI has emerged as a potential turning point, though. This new class

Google Versus OpenAI: The AI Fight For StudentsApr 18, 2025 am 11:31 AM

Immediate Impact versus Long-Term Partnership? Two weeks ago OpenAI stepped forward with a powerful short-term offer, granting U.S. and Canadian college students free access to ChatGPT Plus through the end of May 2025. This tool includes GPT‑4o, an a

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Will R.E.P.O. Have Crossplay?

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Hot Topics

Where is the login entrance for gmail email?

7562

CakePHP Tutorial

1384

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers