Home >Technology peripherals >AI >AltDiffusion-m18, a versatile tool for generating multilingual texts and images
Currently, the selection of non-English text and image generation models is limited, and users often have to translate the prompt into English before entering the model. This will not only cause additional operational burden, but also language and cultural errors in the translation process will affect the accuracy of the generated images.
Zhiyuan Research Institute’s FlagAI team pioneered an efficient training method, using a multi-language pre-training model combined with Stable Diffusion to train a multi-language text and image generation model - AltDiffusion-m18, supporting 18 types Language text-image generation.
Including Chinese, English, Japanese, Thai, Korean, Hindi, Ukrainian, Arabic, Turkish, Vietnamese, Polish, Dutch, Portuguese, Italian, Spanish, German, French, Russian.
Huggingface: https://huggingface.co/BAAI/AltDiffusion-m18
GitHub: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/AltDiffusion -m18
AltDiffusion-m18 achieved Stable Diffusion 95~99% effect in the objective evaluation of FID, IS, CLIP score in English, reached the optimal level in Chinese and Japanese, and filled in the remaining 15 categories. The gap in the language text and picture generation model has greatly satisfied the industry's strong demand for multi-language text and picture generation. Special thanks go to the Stable Diffusion Research Team for providing advice on this work.
In addition, AltDiffusion-m18 related innovative technology report "AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities" has been accepted by Findings of ACL 2023.
AltDiffusion released last year -m9, based on Stable Diffusion v1.4, the Zhiyuan team innovatively replaced the language tower with the multi-language tower AltCLIP, and used multi-language data in nine languages for fine-tuning, extending the original Stable Diffusion that only supports English to support 9 different languages.
AltCLIP: https://github.com/FlagAI-Open/FlagAI/tree/master/examples/AltCLIP-m18
And AltDiffusion-m18 is based on Stable Diffusion v2.1 training. The new language tower of Stable Diffusion v2.1 is the inverted second layer of OpenCLIP. Therefore, the new AltCLIP uses the inverted second layer of OpenCLIP as the distillation target to retrain, and based on m9, it will only use the CrossAttention layer K and V matrices in Unet. Fine-tuning is expanded into a two-stage training method, as shown in the figure below:
- First stage: Earlier during the experiment of m9, it was discovered that fine-tuning the K and V matrices The main thing to learn is the conceptual alignment of text and pictures, so the first stage of m18 training continues to use the data of 18 languages to fine-tune the K and V matrices. In addition, experiments have proven that reducing the resolution of an image from 512*512 to 256*256 does not lose the semantic information of the image. Therefore, in the first stage of learning text-image concept alignment, the resolution of 256*256 is used for training, which speeds up the training.
- The second stage: In order to further improve the quality of the generated images, use the resolution of 512*512 to train the full parameters of Unet in the data of 18 languages. In addition, 10% of the text is discarded for unconditional training to serve classifier-free guidance inference.
- In addition, a classifier-free guided training technique is adopted to further improve the generation quality.
The latest evaluation results show that AltCLIP-m18 surpasses CLIP and reaches the optimal level in Chinese and English zero-shot (zero sample) retrieval tasks⬇️
On multi-language image classification benchmarks, AltCLIP-m9 (early version, supports 9 languages) and AltCLIP-m18 reach the optimal level ⬇️
Similarly, thanks to AltCLIP With the innovative idea of changing towers, AltDiffusion-m18 can also be seamlessly connected to all Stable Diffusion models and ecological tools built on the original CLIP. All tools that support Stable Diffusion such as Stable Diffusion WebUI, DreamBooth, etc. can be applied to AltDiffusion-m18. Painless to get started and great playability!
With the blessing of the new AltCLIP, AltDiffusion-m18 has achieved 95~99% of the original Stable Diffusion effect in the English FID, IS, CLIP score evaluation, and has achieved the most advanced performance in 17 languages including Chinese and Japanese. The performance of AltDiffusion-m18 is shown in the following table:
## In English, Chinese, and Japanese, AltDiffusion-m18 has superior effects and more detailed results than other model generation results. Accurate: AltDiffusion-m18 in (a) above can generate results that are highly consistent with the original Stable Diffusion, and is better than other domestic Chinese-English bilingual models in prompt understanding , for example: "A stuffed bear", "A black and white photo", "cat" and other concepts that failed to be generated in other domestic Chinese-English bilingual models can be successfully generated in AltDiffusion. The same phenomenon occurs in Chinese and Japanese. The "black sofa, wooden floor" in (b) above is only correctly generated by AltDiffusion-m18. The "bears" in (c) above, Japanese Stable Diffusion incorrectly generates "human", but AltDiffusion-m18 can correctly generate "bear". In addition, Zhiyuan FlagEval team developed the text and image generation model evaluation tool ImageEval. After evaluation, the accuracy of AltDiffusion-m18 in the entity object and entity quantity dimensions exceeds that of domestic peer models by 11% and 10% respectively (Note: The ImageEval evaluation method and results will be publicly released in the near future, so stay tuned). 3 The savior of small language texts and pictures, providing a reference system for multilingual text and picture generation models AltDiffusion-m18 learned the biases of different languages from multilingual data, It helps users cross the language translation threshold and bypass cultural translation, reducing the loss of cultural information behind the language. As shown in the figure below, the face outline of the little boy generated by Chinese and Japanese prompts is more "Asian style", while the little boy generated by English and other European language prompts is more "European and American style". What’s more interesting is that the details of the pictures generated by animal prompts in different languages are also different. As shown in the figure below, although the pictures generated in different languages are highly consistent overall, there are subtle differences in the background of the picture and the details of Corgi's facial features. In general, AltDiffusion-m18 provides a basic reference system for multi-language text and image generation models. Users whose native languages include Spanish, German, and French can enjoy the fun of AIGC without having to translate the prompts in their minds into English. AI training experts can also further optimize based on AltDiffusion-m18 by combining DreamBooth, ControlNet and LoRA, or use corpus fine-tuning in other languages to obtain better text and image generation effects. At the same time, FlagAI (github.com/FlagAI-Open/FlagAI), a one-stop open source project for large model algorithms, models and tools, also provides training inference tools and APIs for everyone to quickly download and use. AltDiffusion-m18.The above is the detailed content of AltDiffusion-m18, a versatile tool for generating multilingual texts and images. For more information, please follow other related articles on the PHP Chinese website!