search
HomeTechnology peripheralsAIFor the first time, you don't rely on a generative model, and let AI edit pictures in just one sentence!

2022 is the year of the explosion of artificial intelligence-generated content (AIGC). One of the popular directions is to edit pictures through text descriptions (text prompts). Existing methods usually rely on generative models trained on large-scale data sets, which not only results in high data acquisition and training costs, but also results in larger model sizes. These factors have brought a high threshold to the actual development and application of technology, limiting the development and creativity of AIGC.

In response to the above pain points, NetEase Interactive Entertainment AI Lab collaborated with Shanghai Jiao Tong University to conduct research and innovatively proposed a solution based on differentiable vector renderer - CLIPVG, for the first time It achieves text-guided image editing without relying on any generative model. This solution cleverly uses the characteristics of vector elements to constrain the optimization process, so it can not only avoid massive data requirements and high training overhead, but also achieve the optimal level of generation effects. The corresponding paper "CLIPVG: Text-Guided Image Manipulation Using Differentiable Vector Graphics" has been included in AAAI 2023.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

  • ##Paper address: https://arxiv.org/abs/2212.02122
  • Open source code: https://github.com/NetEase-GameAI/clipvg

Some of the effects are as follows (in order) For face editing, car model modification, building generation, color change, pattern modification, font modification).

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!


For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

## is generating In terms of effect, CLIPVG improves semantic accuracy by 15.9% and generation quality by 23.6% compared to other solutions known in the industry, that is, it can automatically edit and output images that are more semantically appropriate and error-free. In terms of framework flexibility, since CLIPVG can automatically convert pixel images into vector graphics, it can independently edit image colors, shapes, sub-regions, etc. more flexibly than other pixel image-based research frameworks. In terms of application scenarios, since CLIPVG does not rely on generative models at all, it can be applied to a wider range of scenarios, such as portrait stylization, cartoon editing, font design, automatic coloring, etc. It can even achieve one-to-one matching under multiple text conditions. Different parts of the picture can be edited simultaneously.

Ideas and technical background

From the perspective of the overall process, CLIPVG first proposed a multi-round vectorization method that can robustly convert pixel images into vectors domain and adapt to subsequent image editing needs. An ROI CLIP loss is then defined as the loss function to support guidance with different text for each region of interest (ROI). The entire optimization process uses a differentiable vector renderer to perform gradient calculations on vector parameters (such as color block colors, control points, etc.).

CLIPVG combines technologies from two fields, one is text-guided image editing in the pixel domain, and the other is the generation of vector images. Next, the relevant technical background will be introduced in turn.

Text-guided image translation

Typical methods to allow AI to "understand" text guidance during image editing It uses the Contrastive Language-Image Pre-Training (CLIP) model. The CLIP model can encode text and images into comparable latent spaces and provide cross-modal similarity information about "whether the image conforms to the text description", thereby establishing a semantic connection between text and images. However, in fact, it is difficult to effectively guide image editing directly using only the CLIP model. This is because CLIP mainly focuses on the high-level semantic information of the image and lacks constraints on pixel-level details, causing the optimization process to easily fall into a local optimum. (local minimum) or adversarial solutions.

The existing common method is to combine CLIP with a pixel domain generation model based on GAN or Diffusion, such as StyleCLIP (Patashnik et al, 2021), StyleGAN-NADA (Gal et al, 2022), Disco Diffusion (alembics 2022), DiffusionCLIP (Kim, Kwon, and Ye 2022), DALL·E 2 (Ramesh et al, 2022) and so on. These schemes utilize generative models to constrain image details, thus making up for the shortcomings of using CLIP alone. But at the same time, these generative models rely heavily on training data and computing resources, and will make the effective range of image editing limited by the training set images. Limited by the ability to generate models, methods such as StyleCLIP, StyleGAN-NADA, and DiffusionCLIP can only limit a single model to a specific field, such as face images. Although methods such as Disco Diffusion and DALL·E 2 can edit any image, they require massive data and computing resources to train their corresponding generative models.

There are currently very few solutions that do not rely on generative models, such as CLIPstyler (Kwon and Ye 2022). During optimization, CLIPstyler will divide the image to be edited into random patches, and use CLIP guidance on each patch to strengthen the constraints on image details. The problem is that each patch will independently reflect the semantics defined by the input text. As a result, this solution can only perform style transfer, but cannot perform overall high-level semantic editing of the image.

Different from the above pixel domain methods, the CLIPVG solution proposed by NetEase Interactive Entertainment AI Lab uses the characteristics of vector graphics to constrain image details to replace the generative model. CLIPVG can support any input image and can perform general-purpose image editing. Its output is a standard svg format vector graphic, which is not limited by resolution.

Vector image generation

Some existing works consider text-guided vector graphics generation, such as CLIPdraw (Frans, Soros, and Witkowski 2021), StyleCLIPdraw (Schaldenbrand, Liu, and Oh 2022) et al. A typical approach is to combine CLIP with a differentiable vector renderer, and start from randomly initialized vector graphics and gradually approximate the semantics represented by the text. The differentiable vector renderer used is Diffvg (Li et al. 2020), which can rasterize vector graphics into pixel images through differentiable rendering. CLIPVG also uses Diffvg to establish the connection between vector images and pixel images. Different from existing methods, CLIPVG focuses on how to edit existing images rather than directly generating them.

Since most of the existing images are pixel images, they need to be vectorized before they can be edited using the characteristics of vector graphics. Existing vectorization methods include Adobe Image Trace (AIT), LIVE (Ma et al. 2022), etc., but these methods do not consider subsequent editing needs. CLIPVG introduces multiple rounds of vectorization enhancement methods based on existing methods to specifically improve the robustness of image editing.

Technical implementation

The overall process of CLIPVG is shown in the figure below. First, the input pixel image is subjected to multi-round vectorization (Multi-round Vectorization) with different precisions, where the set of vector elements obtained in the i-th round is marked as Θi. The results obtained in each round will be superimposed together as an optimization object, and converted back to the pixel domain through differentiable vector rendering (Differentiable Rasterization). The starting state of the output image is the vectorized reconstruction of the input image, and then iterative optimization is performed in the direction described in the text. The optimization process will calculate the ROI CLIP loss (For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! in the figure below) based on the area range and associated text of each ROI, and optimize each vector element according to the gradient, including color parameters and shape parameters.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

The entire iterative optimization process can be seen in the following example, in which the guide text is "Jocker, Heath Ledger" (Joker, Heath Ledger) .

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

Vectorization

Vector graphics can be defined as a collection of vector elements, where each vector element is controlled by a series of parameters. The parameters of the vector element depend on its type. Taking a filled curve as an example, its parameters are , where is the control Point parameters, are parameters for RGB color and opacity. There are some natural constraints when optimizing vector elements. For example, the color inside an element is always consistent, and the topological relationship between its control points is also fixed. These features make up for CLIP's lack of detailed constraints and can greatly enhance the robustness of the optimization process.

Theoretically, CLIPVG can be vectorized using any existing method. But research has found that doing so can lead to several problems with subsequent image editing. First of all, the usual vectorization method can ensure that the adjacent vector elements of the image are perfectly aligned in the initial state, but each element will move with the optimization process, causing "cracks" to appear between the elements. Secondly, sometimes the input image is relatively simple and only requires a small number of vector elements to fit, while the effect of text description requires more complex details to express, resulting in the lack of necessary raw materials (vector elements) during image editing.

In response to the above problems, CLIPVG proposed a multi-round vectorization strategy. In each round, existing methods will be called to obtain a vectorized result, which will be superimposed in sequence. Each round improves accuracy relative to the previous round, i.e. vectorizes with smaller blocks of vector elements. The figure below reflects the difference in different precisions during vectorization.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

The set of vector elements obtained by the i-th round of vectorization can be expressed as For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!, and the results produced by all rounds The set of vector elements obtained after superposition is denoted as For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!, which is the total optimization object of CLIPVG.

Loss function

Similar to StyleGAN-NADA and CLIPstyler, CLIPVG uses a directional CLIP loss to Measures the correspondence between generated images and description text, which is defined as follows,

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! represents the input text description. For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! is a fixed reference text, set to "photo" in CLIPVG, and For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! is the generated image (the object to be optimized). For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! is the original image. For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! and For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! are the text and image codecs of CLIP respectively. ΔT and ΔI represent the latent space directions of text and image respectively. The purpose of optimizing this loss function is to make the semantic change direction of the image after editing conform to the description of the text. The fixed t_ref is ignored in subsequent formulas. In CLIPVG, the generated image is the result of differentiable rendering of vector graphics. In addition, CLIPVG supports assigning different text descriptions to each ROI. At this time, the directional CLIP loss will be converted into the following ROI CLIP loss,

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

where Ai is the i-th ROI area, which is For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!The associated text description. R is a differentiable vector renderer, and R(Θ) is the entire rendered image. For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! is the entire input image. For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! represents a cropping operation, which means cropping the area For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! from image I. CLIPVG also supports a patch-based enhancement scheme similar to that in CLIPstyler, that is, multiple patches can be further randomly cropped from each ROI, and the CLIP loss is calculated for each patch based on the text description corresponding to the ROI.

The total loss is the sum of ROI CLIP losses in all areas, that is,

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

Here is the one The region can be a ROI, or a patch cropped from the ROI. For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence! is the loss weight corresponding to each area.

CLIPVG will optimize the vector parameter set Θ based on the above loss function. When optimizing, you can also target only a subset of Θ, such as shape parameters, color parameters, or some vector elements corresponding to a specific area.

Experimental results

In the experimental part, CLIPVG first verified the effectiveness of multiple rounds of vectorization strategies and vector domain optimization through ablation experiments, and then compared it with the existing baseline A comparison was made, and unique application scenarios were finally demonstrated.

Ablation experiment

The study first compared the multi-round vectorization (Multi-round) strategy and only one-round vectorization (One- shot) effect. The first line in the figure below is the initial result after vectorization, and the second line is the edited result. where Nc represents the accuracy of vectorization. It can be seen that multiple rounds of vectorization not only improve the reconstruction accuracy of the initial state, but also effectively eliminate the cracks between vector elements after editing and enhance the performance of details.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

#In order to further study the characteristics of vector domain optimization, the paper compares CLIPVG (vector domain method) and CLIPstyler (pixel domain method) using different patch sizes The effect of enhancement. The first line in the figure below shows the effect of CLIPVG using different patch sizes, and the second line shows the effect of CLIPstyler. Its textual description is "Doctor Strange". The resolution of the entire image is 512x512. It can be seen that when the patch size is small (128x128 or 224x224), both CLIPVG and CLIPstyler will display the representative red and blue colors of "Doctor Strange" in small local areas, but the semantics of the entire face do not change significantly. . This is because the CLIP guidance at this time is not applied to the entire image. When CLIPVG increases the patch size to 410x410, you can see obvious changes in character identity, including hairstyles and facial features, which are effectively edited according to text descriptions. If patch enhancement is removed, the semantic editing effect and detail clarity will be reduced, indicating that patch enhancement still has a positive effect. Unlike CLIPVG, CLIPstyler still cannot change the character's identity when the patch is larger or the patch is removed, but only changes the overall color and some local textures. The reason is that the method of enlarging the patch size in the pixel domain loses the underlying constraints and falls into a local optimum. This set of comparisons shows that CLIPVG can effectively utilize the constraints on details in the vector domain and achieve high-level semantic editing combined with the larger CLIP scope (patch size), which is difficult to achieve with pixel domain methods.

Comparative experiment

In the comparative experiment, the study first used CLIPVG and two methods to edit any picture. The pixel domain methods were compared, including Disco Diffusion and CLIPstyler. As you can see in the figure below, for the example of "Self-Portrait of Vincent van Gogh", CLIPVG can edit the character identity and painting style at the same time, while the pixel domain method only can achieve one of them. For "Gypsophila", CLIPVG can edit the number and shape of petals more accurately than the baseline method. In the examples of "Jocker, Heath Ledger" and "A Ford Mustang", CLIPVG can also robustly change the overall semantics. Relatively speaking, Disco Diffusion is prone to local flaws, while CLIPstyler generally only adjusts the texture and color.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

(Top down: Van Gogh painting, gypsophila, Heath Ledger Joker , Ford Mustang)

#The researchers then compared pixel domain methods for images in specific fields (taking human faces as an example), including StyleCLIP, DiffusionCLIP and StyleGAN-NADA. Due to the restricted scope of use, the generation quality of these baseline methods is generally more stable. In this set of comparisons, CLIPVG still shows that the effect is not inferior to existing methods, especially the degree of consistency with the target text is often higher.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

(Top to bottom: Doctor Strange, White Walkers, Zombies)

More applications

Using the characteristics of vector graphics and ROI-level loss functions, CLIPVG can support a series of innovative gameplay that are difficult to achieve with existing methods. For example, the editing effect of the multi-person picture shown at the beginning of this article is achieved by defining different ROI level text descriptions for different characters. The left side of the picture below is the input, the middle is the editing result of the ROI level text description, and the right side is the result of the entire picture having only one overall text description. The descriptions corresponding to areas A1 to A7 are 1. "Justice League Six", 2. "Aquaman", 3. "Superman", 4. "Wonder Woman" ), 5. "Cyborg" (Cyborg), 6. "Flash, DC Superhero" (The Flash, DC) and 7. "Batman" (Batman). It can be seen that the description at the ROI level can be edited separately for each character, but the overall description cannot generate effective individual identity characteristics. Since the ROIs overlap with each other, it is difficult for existing methods to achieve the overall coordination of CLIPVG even if each character is edited individually.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

CLIPVG can also achieve a variety of special editing effects by optimizing some vector parameters. The first line in the image below shows the effect of editing only a partial area. The second line shows the font generation effect of locking the color parameters and optimizing only the shape parameters. The third line is the opposite of the second line, achieving the purpose of recoloring by optimizing only the color parameters.

For the first time, you dont rely on a generative model, and let AI edit pictures in just one sentence!

(Top-down: sub-area editing, font stylization, image color change)

The above is the detailed content of For the first time, you don't rely on a generative model, and let AI edit pictures in just one sentence!. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
从VAE到扩散模型:一文解读以文生图新范式从VAE到扩散模型:一文解读以文生图新范式Apr 08, 2023 pm 08:41 PM

1 前言在发布DALL·E的15个月后,OpenAI在今年春天带了续作DALL·E 2,以其更加惊艳的效果和丰富的可玩性迅速占领了各大AI社区的头条。近年来,随着生成对抗网络(GAN)、变分自编码器(VAE)、扩散模型(Diffusion models)的出现,深度学习已向世人展现其强大的图像生成能力;加上GPT-3、BERT等NLP模型的成功,人类正逐步打破文本和图像的信息界限。在DALL·E 2中,只需输入简单的文本(prompt),它就可以生成多张1024*1024的高清图像。这些图像甚至

找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了找不到中文语音预训练模型?中文版 Wav2vec 2.0和HuBERT来了Apr 08, 2023 pm 06:21 PM

Wav2vec 2.0 [1],HuBERT [2] 和 WavLM [3] 等语音预训练模型,通过在多达上万小时的无标注语音数据(如 Libri-light )上的自监督学习,显著提升了自动语音识别(Automatic Speech Recognition, ASR),语音合成(Text-to-speech, TTS)和语音转换(Voice Conversation,VC)等语音下游任务的性能。然而这些模型都没有公开的中文版本,不便于应用在中文语音研究场景。 WenetSpeech [4] 是

普林斯顿陈丹琦:如何让「大模型」变小普林斯顿陈丹琦:如何让「大模型」变小Apr 08, 2023 pm 04:01 PM

“Making large models smaller”这是很多语言模型研究人员的学术追求,针对大模型昂贵的环境和训练成本,陈丹琦在智源大会青源学术年会上做了题为“Making large models smaller”的特邀报告。报告中重点提及了基于记忆增强的TRIME算法和基于粗细粒度联合剪枝和逐层蒸馏的CofiPruning算法。前者能够在不改变模型结构的基础上兼顾语言模型困惑度和检索速度方面的优势;而后者可以在保证下游任务准确度的同时实现更快的处理速度,具有更小的模型结构。陈丹琦 普

解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉Transformer解锁CNN和Transformer正确结合方法,字节跳动提出有效的下一代视觉TransformerApr 09, 2023 pm 02:01 PM

由于复杂的注意力机制和模型设计,大多数现有的视觉 Transformer(ViT)在现实的工业部署场景中不能像卷积神经网络(CNN)那样高效地执行。这就带来了一个问题:视觉神经网络能否像 CNN 一样快速推断并像 ViT 一样强大?近期一些工作试图设计 CNN-Transformer 混合架构来解决这个问题,但这些工作的整体性能远不能令人满意。基于此,来自字节跳动的研究者提出了一种能在现实工业场景中有效部署的下一代视觉 Transformer——Next-ViT。从延迟 / 准确性权衡的角度看,

Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Stable Diffusion XL 现已推出—有什么新功能,你知道吗?Apr 07, 2023 pm 11:21 PM

3月27号,Stability AI的创始人兼首席执行官Emad Mostaque在一条推文中宣布,Stable Diffusion XL 现已可用于公开测试。以下是一些事项:“XL”不是这个新的AI模型的官方名称。一旦发布稳定性AI公司的官方公告,名称将会更改。与先前版本相比,图像质量有所提高与先前版本相比,图像生成速度大大加快。示例图像让我们看看新旧AI模型在结果上的差异。Prompt: Luxury sports car with aerodynamic curves, shot in a

五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药五年后AI所需算力超100万倍!十二家机构联合发表88页长文:「智能计算」是解药Apr 09, 2023 pm 07:01 PM

人工智能就是一个「拼财力」的行业,如果没有高性能计算设备,别说开发基础模型,就连微调模型都做不到。但如果只靠拼硬件,单靠当前计算性能的发展速度,迟早有一天无法满足日益膨胀的需求,所以还需要配套的软件来协调统筹计算能力,这时候就需要用到「智能计算」技术。最近,来自之江实验室、中国工程院、国防科技大学、浙江大学等多达十二个国内外研究机构共同发表了一篇论文,首次对智能计算领域进行了全面的调研,涵盖了理论基础、智能与计算的技术融合、重要应用、挑战和未来前景。论文链接:​https://spj.scien

​什么是Transformer机器学习模型?​什么是Transformer机器学习模型?Apr 08, 2023 pm 06:31 PM

译者 | 李睿审校 | 孙淑娟​近年来, Transformer 机器学习模型已经成为深度学习和深度神经网络技术进步的主要亮点之一。它主要用于自然语言处理中的高级应用。谷歌正在使用它来增强其搜索引擎结果。OpenAI 使用 Transformer 创建了著名的 GPT-2和 GPT-3模型。自从2017年首次亮相以来,Transformer 架构不断发展并扩展到多种不同的变体,从语言任务扩展到其他领域。它们已被用于时间序列预测。它们是 DeepMind 的蛋白质结构预测模型 AlphaFold

AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军AI模型告诉你,为啥巴西最可能在今年夺冠!曾精准预测前两届冠军Apr 09, 2023 pm 01:51 PM

说起2010年南非世界杯的最大网红,一定非「章鱼保罗」莫属!这只位于德国海洋生物中心的神奇章鱼,不仅成功预测了德国队全部七场比赛的结果,还顺利地选出了最终的总冠军西班牙队。不幸的是,保罗已经永远地离开了我们,但它的「遗产」却在人们预测足球比赛结果的尝试中持续存在。在艾伦图灵研究所(The Alan Turing Institute),随着2022年卡塔尔世界杯的持续进行,三位研究员Nick Barlow、Jack Roberts和Ryan Chan决定用一种AI算法预测今年的冠军归属。预测模型图

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor