search
HomeTechnology peripheralsAICLIP is not down to earth? You need a model that understands Chinese better

This article introduces the Chinese CLIP large-scale pre-training image and text representation model recently open sourced by the Damo Academy Magic Community ModelScope, which can better understand Chinese and Chinese Internet images, and can perform multiple tasks such as image and text retrieval and zero-sample image classification. To achieve the best results, the code and models have all been open source, so users can use Magic to get started quickly.

CLIP is not down to earth? You need a model that understands Chinese better

  • Model usage entrance: https://modelscope.cn/models/damo/multi-modal_clip-vit-base-patch16_zh/summary
  • Github: https://github.com/OFA-Sys/Chinese-CLIP
  • Paper: https://arxiv.org/pdf /2211.01335.pdf
  • Graphic and text retrieval demo: https://modelscope.cn/studios/damo/chinese_clip_applications/summary

1. Introduction

In the current Internet ecosystem, there are countless multi-modal related tasks and scenarios, such as image and text retrieval, image classification, video and image and text content and other scenarios. In recent years, image generation, which has become popular all over the Internet, has become even more popular and has quickly gone out of the circle. Behind these tasks, a powerful image and text understanding model is obviously necessary. I believe everyone will be familiar with the CLIP model launched by OpenAI in 2021. Through simple image-text twin tower comparison learning and a large amount of image-text corpus, the model has significant image-text feature alignment capabilities, and can be used in zero-sample image classification, It has outstanding results in cross-modal retrieval and is also used as a key module in image generation models such as DALLE2 and Stable Diffusion.

But unfortunately, OpenAI CLIP’s pre-training mainly uses graphic and text data from the English world and cannot naturally support Chinese. Even if there are researchers in the community who have distilled multilingual versions of Multilingual-CLIP (mCLIP) through translated texts, they still cannot meet the needs of the Chinese world, and their understanding of texts in the Chinese field is not very good, such as searching for "Spring Festival couplets" , but what is returned is Christmas-related content:

CLIP is not down to earth? You need a model that understands Chinese better

##mCLIP Retrieve demo Search for "Spring Festival Couplets" Return results

This also shows that we need a CLIP who understands Chinese better, not only understands our language, but also understands the images of the Chinese world.

2. Method

Researchers at DAMO Academy collected large-scale Chinese image-text pair data (approximately 200 million in size), including data from LAION-5B Chinese subset, Wukong's Chinese data, and translated graphic and text data from COCO, Visual Genome, etc. Most of the training images and texts come from public data sets, which greatly reduces the difficulty of reproduction. In terms of training methods, in order to effectively improve the training efficiency and model effect of the model, the researchers designed a two-stage training process:

CLIP is not down to earth? You need a model that understands Chinese better

##Chinese CLIP method diagram

As shown in the figure, in the first stage, the model uses the existing image pre-training model and text pre-training The model initializes the twin towers of Chinese-CLIP separately and freezes the image-side parameters, allowing the language model to associate with the existing image pre-training representation space while reducing training overhead. Subsequently, in the second stage, the image side parameters are unfrozen, allowing the image model and language model to be associated while modeling the data distribution with Chinese characteristics. The researchers found that compared with pre-training from scratch, this method showed significantly better experimental results on multiple downstream tasks, and its significantly higher convergence efficiency also meant smaller training overhead. Compared with only training the text side in one stage of training, adding the second stage of training can effectively further improve the effect on downstream graphics and text tasks, especially graphics and text tasks native to Chinese (rather than translated from English data sets).

CLIP is not down to earth? You need a model that understands Chinese better

On two data sets: MUGE Chinese e-commerce image and text retrieval, Flickr30K-CN translation version general image and text retrieval Observe the effect change trend of zero-shot as pre-training continues

Using this strategy, researchers have trained models of multiple scales, from the smallest ResNet-50, ViT-Base and Large to ViT-Huge. They are all now open and users can fully access them on demand. Use the model that best suits your scenario:

CLIP is not down to earth? You need a model that understands Chinese better

3. Experiment

Multiple experimental data show that Chinese-CLIP can be used in Chinese Cross-modal retrieval has achieved the best performance. Among them, on the Chinese native e-commerce image retrieval data set MUGE, Chinese CLIP of multiple scales has achieved the best performance at this scale. On data sets such as English-native Flickr30K-CN, Chinese CLIP can significantly exceed domestic baseline models such as Wukong, Taiyi, and R2D2, regardless of zero sample or fine-tuning settings. This is largely due to Chinese-CLIP's larger Chinese pre-training image and text corpus, and Chinese-CLIP is different from some existing domestic image and text representation models in order to minimize the training cost and freeze the entire image side. Instead, it uses two Staged training strategies to better adapt to the Chinese field:

CLIP is not down to earth? You need a model that understands Chinese better

MUGE Chinese e-commerce image and text retrieval data Set experimental results

CLIP is not down to earth? You need a model that understands Chinese better

##Flickr30K-CN Chinese image and text retrieval data set experimental results

#At the same time, the researchers verified the effect of Chinese CLIP on the zero-sample image classification data set. Since there are not many authoritative zero-shot image classification tasks in the Chinese field, the researchers are currently testing on the English translation version of the data set. Chinese-CLIP can achieve comparable performance to CLIP on these tasks through Chinese prompts and category labels:

CLIP is not down to earth? You need a model that understands Chinese better

Zero-sample classification experiment results

CLIP is not down to earth? You need a model that understands Chinese better

#Zero-sample image classification example 4. Quick use

How can I use Chinese-CLIP? It's very simple. Click the link at the beginning of the article to visit the Moda community or use the open source code. You can complete image and text feature extraction and similarity calculation in just a few lines. For quick use and experience, the Moda community provides a Notebook with a configured environment. You can use it by clicking on the upper right corner.

CLIP is not down to earth? You need a model that understands Chinese better

Chinese-CLIP also supports users to use their own data for finetune, and also provides a demo of image and text retrieval for everyone to actually experience Chinese -The effects of CLIP models of various scales:

CLIP is not down to earth? You need a model that understands Chinese better5. Conclusion

This time the Damoda community launched the Chinese-CLIP project, It provides an excellent pre-trained image and text understanding model for the majority of Chinese multi-modal research and industry users, helping everyone to quickly get started with image and text features & similarity calculation, image and text retrieval and zero-sample classification without any threshold, and you can try to use it It is suitable for building more complex multi-modal applications such as image generation. Friends who want to show off their talents in the Chinese multi-modal field, please don’t miss it! And this is just one of the applications in the Moda community. ModelScope allows many basic models in the AI ​​field to play the role of application base, supporting the birth of more innovative models, applications and even products.

The above is the detailed content of CLIP is not down to earth? You need a model that understands Chinese better. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
The AI Skills Gap Is Slowing Down Supply ChainsThe AI Skills Gap Is Slowing Down Supply ChainsApr 26, 2025 am 11:13 AM

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

How One Company Is Quietly Working To Transform AI ForeverHow One Company Is Quietly Working To Transform AI ForeverApr 26, 2025 am 11:12 AM

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Nvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentNvidia Releases NeMo Microservices To Streamline AI Agent DevelopmentApr 26, 2025 am 11:11 AM

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI Paints A New Picture For The Future Of Art And DesignAI Paints A New Picture For The Future Of Art And DesignApr 26, 2025 am 11:10 AM

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

How Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesHow Zoom Is Revolutionizing Work With Agentic AI: From Meetings To MilestonesApr 26, 2025 am 11:09 AM

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

The Existential Threat To UniversitiesThe Existential Threat To UniversitiesApr 26, 2025 am 11:08 AM

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The Prototype: American Scientists Are Looking For Jobs AbroadThe Prototype: American Scientists Are Looking For Jobs AbroadApr 26, 2025 am 11:07 AM

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

All About Open AI's Latest GPT 4.1 Family - Analytics VidhyaAll About Open AI's Latest GPT 4.1 Family - Analytics VidhyaApr 26, 2025 am 10:19 AM

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function