search
HomeTechnology peripheralsAIHow to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Have you ever had trouble with image retrieval?

Either it is difficult to accurately find the required image among the massive images, or it is unsatisfactory in text-based retrieval. Regarding this problem, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division conducted in-depth research on lightweight visual models and proposed a series of design and compression methods for visual pre-training models to realize the visual Transformer. Lightweight deployment requirements.

Currently, this method and model have been successfully applied to Microsoft’s Bing search engine, achieving accurate and fast reasoning and retrieval of tens of billions of images. This article will provide an in-depth explanation of the development, key technologies, applications and potential of lightweight visual pre-training models, as well as future opportunities and challenges. I hope everyone can better understand the field of lightweight visual pre-training and jointly promote the development of related technologies.

Recently, Transformer-based visual pre-training models have achieved superior performance on many computer vision tasks and have received widespread attention. However, visual Transformer pre-training models usually have large parameters and high complexity, which restricts their deployment and use in practical applications, especially in resource-constrained devices or scenarios with high real-time requirements. Therefore, the research on “lightweighting” of large visual pre-training models has become a new hot topic in academia and industry.

In this regard, researchers from Microsoft Research Asia and Microsoft Cloud Computing and Artificial Intelligence Division conducted in-depth exploration on the structural design and training inference of large visual models. They also conducted The lightweight, real-time and cloud deployment of large models have also been innovatively applied. This article will start from the development of lightweight visual pre-training models, explore the key technologies in model lightweight research, and the application and potential of lightweight visual Transformer models in actual products, and finally look forward to the future development opportunities and prospects of lightweight visual models. challenge.

Large visual models emerge in endlessly, but lightweight pre-trained models are of little interest

In recent years, deep learning has been used in ImageNet image classification tasks The progress is mainly due to the substantial expansion of the visual model capacity. As shown in Figure 1, in just a few years, the capacity of visual pre-training models has expanded more than 300 times, from the ResNet-101 model with 44.5 million parameters to the V-MoE model with 15 billion parameters. These large-scale visual pre-training models Training models have made great strides in tasks such as image understanding and visual content generation.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 1: Change trend chart of visual pre-training model parameters

Whether it is Microsoft The 3 billion parameter Swin-V2 model is still the 1.8 billion parameter ViT-G/14 model released by Google. The large visual model has demonstrated superior performance in many tasks, especially its powerful small sample (few-shot) and even The generalization ability of zero-shot is very critical to achieving general intelligence.

However, in many practical scenarios, due to limitations of storage and computing resources, large models are difficult to deploy directly or cannot meet real-time needs. Therefore, research on lightweight visual pre-training models has become increasingly important and has strong practical application value. Although there is currently some work exploring lightweight models, most of these methods are designed for specific tasks and specific structures. The versatility of the model is not considered during the design and training process, and there is generalization across data domains and tasks. limitation.

Research on key technologies of lightweight visual models

In order to achieve lightweight visual pre-training models, Microsoft researchers discovered two key technologies Questions: 1) How to design a more versatile lightweight model structure? 2) Subject to the limited capacity of lightweight visual pre-training models, how to design efficient pre-training methods so that small models can learn effective information from large-scale data? Faced with these problems, researchers have achieved some initial results through unremitting research and exploration.

Since the core of improving the versatility of lightweight pre-training models lies in how to strengthen the learning ability of the model under the condition of limited resources (amount of parameters, delay, etc.), so that it can be more capable It is good to learn general features in large-scale data. Therefore, researchers have conducted in-depth exploration from the following three perspectives:

1. Lightweight module design

Lightweight and low-latency modules are an important part of the lightweight model. In convolutional neural networks, representative lightweight modules include MobileNet's Inverted Residual Block and ShuffleNet's channel random crossover unit (Shuffle Unit). In the visual Transformer structure, since the calculation of attention between image blocks does not well consider the relative position encoding information, the researchers designed a plug-and-play lightweight two-dimensional image relative position encoding method iRPE [1]. It can improve the performance of the model without modifying any training hyperparameters. In addition, to address the problem of visual Transformer parameter redundancy, researchers designed the Weight Multiplexing module [2]. As shown in Figure 2, this method reduces the redundancy of model parameters through multi-layer weight reuse, and introduces unshared linear transformations to increase parameter diversity.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 2: Weight multiplexing module in Transformer

Neural Architecture Search can automatically find a more lightweight and better-performing model structure from the model design space [3]. In convolutional neural networks, representative works include NASNet and EfficientNet. In the visual Transformer structure search, researchers have successively proposed AutoFormer [4] and S3 [5] for multiple dimensions such as channel width, network depth, and number of heads in the visual model, realizing dynamic scalable training and scalability of the visual model. Structure search. Under the same model accuracy, the new model obtained through search has a smaller number of parameters and calculations. It is worth noting that in S3, researchers used E-T Error [5] and weight sharing supernet to guide and improve the search space. While obtaining a more efficient model structure, they also analyzed the evolution process of the search space, as shown in Figure 3 shown. At the same time, the process of model structure search provides effective design experience and reference for the design of lightweight models.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 3: Lightweight model search space evolution process

3. Visual large model compression Knowledge Transfer Another problem with lightweight pre-trained models is that due to the limited capacity of the model, it is difficult to directly learn the rich information and knowledge contained in large-scale data. In order to solve this problem, researchers have proposed a fast pre-training distillation scheme to transfer the knowledge of large models to lightweight small models [6]. As shown in Figure 4, unlike traditional single-stage knowledge distillation, fast pre-training distillation is divided into two stages: 1) compress and save the data augmentation information and prediction information used in the large model training process; 2) load and restore After the prediction information and data of the large model are augmented, the large model is used as a teacher to guide the learning and training of lightweight student models through pre-training distillation. Different from pruning and quantization, this method uses the weight reuse mentioned above [2] based on weight sharing. By introducing lightweight weight transformation and distillation, it successfully compresses the large visual pre-training model and obtains universal A more robust lightweight model. This method can compress the original large model dozens of times without sacrificing performance.

Figure 4: Rapid pre-training knowledge distillation

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

This series of research results is not only Many papers have been published at top academic conferences on computer vision (CVPR, ICCV, ECCV, NeurIPS, etc.) [1-6], and through cooperation with Microsoft Bing, lightweight pre-training models have been successfully applied to image search products. , improving the ability to understand image and video content in actual business.

Application of lightweight visual pre-training model

Lightweight visual pre-training models have many practical uses, especially in scenarios with high real-time requirements or resource constraints, such as: real-time rendering and enhancement of cloud videos, end-to-end image testing, and video content understanding. Lightweight visual models have shown broad application prospects in smart retail, advanced manufacturing and other fields, and will play an important role in emerging industries such as the Metaverse and autonomous driving in the future. Taking image content search in Microsoft's Bing product as an example, the following will show you the practical application and deployment of lightweight visual models.

At present, content-based image search is relatively mature in understanding the category attributes of images, but there are still great challenges in understanding the content of complex scenes. Pictures of complex scenes usually have characteristics such as large depth of field, cluttered backgrounds, many characters, and complex object relationships, which significantly increase the difficulty of content understanding, thus placing higher requirements on the robustness and generalization of pre-training models.

For example, the search quality of anime pictures cannot be effectively improved for a long time. The main challenges include: painting lines and colors are more exaggerated than real scene pictures, including More action and scenes, and the style content varies greatly between comics. Figures 5 to 7 respectively show three different cartoon characters and behaviors of "Slam Dunk", "Pikachu" and "Captain". Their comic styles and contents are very different. How to effectively understand the content of comic pictures puts forward higher requirements for visual pre-training models.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 5: In the Microsoft Bing search engine, the understanding of the slam dunk master’s actions includes: dunking, dribbling, stealing, shooting, etc.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 6: In Microsoft Bing search engine, understanding of Pikachu’s behavior such as eating apples, eating watermelon, eating ice cream, etc.

How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model

Figure 7: Close-up of the young football player’s shooting action in Microsoft’s Bing search engine

above The lightweight visual general model and fast pre-training distillation algorithm mentioned have been successfully used in Microsoft's Bing search engine. With the help of the visual language multi-modal pre-training model provided by Microsoft Research Asia, Microsoft's Bing image search function enhances the understanding of comic content and can return image content that better matches user needs.

At the same time, the huge index library of Microsoft Bing search engine has very high requirements for retrieval efficiency. The rapid pre-training distillation method provided by Microsoft Research Asia effectively migrates the indexing capabilities of the pre-trained large model to a lightweight model, improving the recognition accuracy of the existing model by 14% and greatly optimizing the calculation of the model. Efficiency, achieving fast reasoning on tens of billions of images.

Future opportunities and challenges

Model lightweighting is the core of the future application of artificial intelligence. As vision technology, algorithms, computing power, and data continue to improve, the complexity of models has increased dramatically, and the energy consumption of neural network calculations has become increasingly expensive. The lightweight visual model's high computational efficiency and low deployment and application costs can play a huge advantage in more actual products in the future. In addition, localized lightweight pre-trained visual models can better protect user data and privacy while supporting more services. User's data will no longer need to leave the device, allowing remote upgrades of functions such as model services.

Of course, researchers are also aware of the challenges faced by lightweight pre-trained visual models: on the one hand, in terms of model structure design, how to achieve the optimal learning ability of the model under the constraints of the number of model parameters and inference delay, It has always been a matter of close concern in academia and industry. Although many effective model structures have been accumulated and great progress has been made in fields such as Universal Approximation Theorem (UAT) and Neural Network Structure Search (NAS), the existing lightweight pre-trained visual models and visual large-scale There are still gaps between models that need to be further optimized and improved. On the other hand, in terms of training methods, academia and industry have proposed a variety of training methods such as self-supervision, image classification, and multi-modality for large visual models, which have significantly improved the general capabilities of the model. How to design a more effective training method for lightweight models with limited capacity requires further research and exploration. Researchers at Microsoft Research Asia will continue to promote the scientific research progress of lightweight pre-trained visual models, and welcome more technology colleagues to communicate and explore related technologies in this field. ​

The above is the detailed content of How to perform image search efficiently and accurately? Take a look at the lightweight vision pre-trained model. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
A Comprehensive Guide to ExtrapolationA Comprehensive Guide to ExtrapolationApr 15, 2025 am 11:38 AM

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

The Rise Of Soft AI And What It Means For Businesses TodayThe Rise Of Soft AI And What It Means For Businesses TodayApr 15, 2025 am 11:36 AM

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

Evolving Security Frameworks For The AI FrontierEvolving Security Frameworks For The AI FrontierApr 15, 2025 am 11:34 AM

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!3 Ways Generative AI Amplifies Entrepreneurs: Beware Of Averages!Apr 15, 2025 am 11:33 AM

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

New Short Course on Embedding Models by Andrew NgNew Short Course on Embedding Models by Andrew NgApr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Is Hallucination in Large Language Models (LLMs) Inevitable?Is Hallucination in Large Language Models (LLMs) Inevitable?Apr 15, 2025 am 11:31 AM

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

The 60% Problem — How AI Search Is Draining Your TrafficThe 60% Problem — How AI Search Is Draining Your TrafficApr 15, 2025 am 11:28 AM

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

MIT Media Lab To Put Human Flourishing At The Heart Of AI R&DMIT Media Lab To Put Human Flourishing At The Heart Of AI R&DApr 15, 2025 am 11:26 AM

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment