Home > Article > Technology peripherals > Midjourney's rival is here! Google's StyleDrop ace "Customization Master" detonates the AI art circle
As soon as Google StyleDrop came out, it instantly hit the internet.
Given Van Gogh’s Starry Night, AI transformed into Master Van Gogh, and after a top-level understanding of this abstract style, it created countless similar paintings.
Another cartoon style, the objects I want to draw are much more cute.
Even, it can accurately control the details and design an original style logo.
The charm of StyleDrop is that you only need one picture as a reference, and you can deconstruct and recreate the artistic style no matter how complex it is.
Netizens expressed that this is another AI tool that eliminates designers.
StyleDrop hot research is the latest product from the Google research team.
##Paper address: https://arxiv.org/pdf/2306.00983.pdf
Now, with tools like StyleDrop, not only can you draw with more control, but you can also complete previously unimaginable fine work, such as drawing a logo.
Even NVIDIA scientists called it a "phenomenal" result.
"Customization" MasterThe author of the paper introduced that the source of inspiration for StyleDrop is Eyedropper (color absorption) /color picker tool).
Similarly, StyleDrop also hopes that everyone can quickly and effortlessly "pick" a style from a single/few reference images to generate an image of that style.
##A sloth can have 18 styles:A panda has 24 styles:
A watercolor painting drawn by a child, StyleDrop perfectly controls it, even the paper The folds have been restored.
I have to say, it’s too strong.
There is also StyleDrop that refers to the design of English letters in different styles:
## are also the letters of Van Gogh style.
There are also line drawings. Line drawing is a highly abstract image and requires very high rationality in the composition of the picture. Past methods have been difficult to succeed.
The strokes of the cheese shadow in the original image are restored to the objects in each image.
Refer to Android LOGO creation.
In addition, the researchers also expanded the capabilities of StyleDrop, not only to customize the style, combined with DreamBooth, but also to customize the content.
For example, still in the Van Gogh style, generate a similar style painting for the little Corgi:
Here’s another one. The corgi below feels like the “Sphinx” on the Egyptian pyramids.
StyleDrop is built on Muse and consists of two key parts:
One is the effective fine-tuning of the parameters that generate the visual Transformer, and the other is iteration with feedback train.
The researchers then synthesized images from the two fine-tuned models.
Muse is the latest text-to-image synthesis model based on mask-generated image Transformer. It contains two synthesis modules for base image generation (256 × 256) and super-resolution (512 × 512 or 1024 × 1024).
Each module consists of a text encoder T, a transformer G, a sampler S, and an image encoder It consists of decoder E and decoder D.
T maps the textual prompt t∈T to the continuous embedding space E. G processes text embeddings e ∈ E to generate logarithms of visual token sequences l ∈ L. S extracts a sequence of visual tokens v ∈ V from the logarithm through iterative decoding that runs several steps of transformer inference conditioned on the text embedding e and the visual token decoded from the previous step.
Finally, D maps the discrete token sequence to the pixel space I. In summary, given a text prompt t, the composition of image I is as follows:
Figure 2 is a simplified Muse transformer layer architecture, which was partially modified to support Parameter Efficient Fine-tuning (PEFT) and adapters.
Use the transformer of the L layer to process the visual token sequence displayed in green under the condition of text embedding e. The learned parameters θ are used to construct weights for adapter tuning.
In order to train θ, in many cases, researchers may only give pictures as style references.
Researchers need to manually attach text prompts. They proposed a simple, templated approach to constructing text prompts consisting of a description of the content followed by a description-style phrase.
For example, the researcher uses "cat" to describe an object in Table 1 and appends "watercolor painting" as a style description.
Including a description of content and style in a text prompt is critical because it helps separate content from style, which is the research The main goal of the personnel.
Figure 3 shows iterative training with feedback.
When training on a single style reference image (orange box), some images generated by StyleDrop may exhibit content extracted from the style reference image (red box, image The background contains a house similar to the style image).
Other images (blue boxes) better separate the style from the content. Iterative training of StyleDrop on good samples (blue box) results in a better balance between style and text fidelity (green box).
##The researchers also used two methods here:
-CLIP score
#This method is used to measure the alignment of images and text. Therefore, it can evaluate the quality of the generated images by measuring the CLIP score (i.e., the cosine similarity of visual and textual CLIP embeddings).
Researchers can select the CLIP image with the highest score. They call this method CLIP-feedback iterative training (CF).
In experiments, the researchers found that using CLIP scores to evaluate the quality of synthetic images is an effective way to improve recall (i.e., text fidelity) without excessive loss Style fidelity.
On the other hand, however, CLIP scores may not fully align with human intent, nor capture subtle stylistic attributes.
-HF
Human feedback (HF) is a method that injects user intent directly into synthetic image quality assessment in a more direct way.
In LLM fine-tuning for reinforcement learning, HF has proven its power and effectiveness.
HF can be used to compensate for the inability of CLIP scores to capture subtle style attributes.
Currently, a large amount of research has focused on the personalization problem of text-to-image diffusion models to synthesize images containing multiple personal styles.
Researchers show how DreamBooth and StyleDrop can be combined in a simple way to personalize both style and content.
This is accomplished by sampling from two modified generative distributions, guided by θs for style and θc for content, independently on the style and content reference images respectively. Trained adapter parameters.
Unlike existing products, the team’s approach does not require joint training of learnable parameters on multiple concepts, which leads to greater combinatorial capabilities. Because pre-trained adapters are trained on individual topics and styles separately.
The researchers’ overall sampling process follows the iterative decoding of Equation (1), with the logarithms sampled differently in each decoding step.
Suppose t is a text prompt, c is a text prompt without style descriptor, and the logarithm is calculated in step k as follows:
Where: γ is used to balance StyleDrop and DreamBooth - if γ is 0, we get StyleDrop, if it is 1, we get DreamBooth.
By setting γ appropriately, we can get a suitable image.
Experimental settings
So far, there is no Style adjustment of text-image generative models has been extensively studied.
Therefore, the researchers proposed a new experimental plan:
-Data collection
The researchers collected dozens of pictures in different styles, from watercolor and oil paintings, flat illustrations, 3D renderings to sculptures of different materials.
-Model Configuration
Researchers use adapters to tune Muse-based StyleDrop. For all experiments, the Adam optimizer was used to update the adapter weights for 1000 steps with a learning rate of 0.00003. Unless otherwise stated, the researchers use StyleDrop to represent the second round of the model, which was trained on more than 10 synthetic images with human feedback.
- Evaluation
Quantitative evaluation of research reports based on CLIP, measuring style consistency and textual alignment. Additionally, the researchers conducted user preference studies to assess style consistency and text alignment.
As shown in the picture, the results of StyleDrop processing of 18 pictures of different styles collected by the researchers.
As you can see, StyleDrop is able to capture the nuances of texture, shading and structure of various styles, giving you better control over style than before.
For comparison, the researchers also introduced the results of DreamBooth on Imagen, DreamBooth on Stable Diffusion and LoRA Realization and text inversion results.
#The specific results are shown in the table, human image-text alignment (Text) and visual style alignment (Style) Evaluation metrics for score (top) and CLIP score (bottom).
## Qualitative comparison of (a) DreamBooth, (b) StyleDrop, and (c) DreamBooth StyleDrop:
Here, the researchers applied the two metrics of the CLIP score mentioned above - text and style scores.
For text score, researchers measure the cosine similarity between image and text embeddings. For the style score, the researchers measure the cosine similarity between the style reference and the synthetic image embedding.
The researchers generated a total of 1520 images for 190 text prompts. While the researchers hoped the final score would be higher, the metrics are not perfect.
And iterative training (IT) improved text scores, which was in line with the researchers’ goals.
However, as a trade-off, their style scores on the first-round model are reduced because they are trained on synthetic images and the style may be biased by selection bias.
DreamBooth on Imagen is not as good as StyleDrop in style score (HF's 0.644 vs. 0.694).
The researchers noticed that the increase in style score of DreamBooth on Imagen was not obvious (0.569 → 0.644), while the increase of StyleDrop on Muse was more obvious (0.556 → 0.694).
Researchers analyzed that the style fine-tuning on Muse is more effective than that on Imagen.
In addition, for fine-grained control, StyleDrop captures subtle style differences, such as color offset, gradation, or sharp angle control.
If designers have StyleDrop, their work efficiency will be 10 times faster and it has already taken off. .
One day of AI, 10 years of human life, AIGC is developing at the speed of light, the kind of light speed that blinds people's eyes!
Tools just follow the trend, and those that should be eliminated have already been eliminated long ago.
This tool is much easier to use than Midjourney for making logos.
The above is the detailed content of Midjourney's rival is here! Google's StyleDrop ace "Customization Master" detonates the AI art circle. For more information, please follow other related articles on the PHP Chinese website!