search
HomeTechnology peripheralsAICambridge team's open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever



  • ##Paper link : https://arxiv.org/abs/2402.08327
  • DEMO link: https://u60544-b8d4-53eaa55d.westx.seetacloud.com:8443 /
  • Project homepage link: https://preflmr.github.io/
  • # #Paper title: PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers
  • ##Background

Although large multi-modal models (such as GPT4-Vision, Gemini, etc.) demonstrate powerful general image and text understanding capabilities, their performance is not satisfactory when dealing with problems that require professional knowledge. Even GPT4-Vision cannot effectively answer knowledge-intensive questions (as shown in Figure 1), which poses challenges to many enterprise-level applications.

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retrieverGPT4-Vision can obtain relevant knowledge through the PreFLMR multi-modal knowledge retriever and generate accurate answers. The figure shows the actual output of the model.

Retrieval-Augmented Generation (RAG) provides a simple and effective way to solve this problem, making multi-modal large models become like "domains" in a certain field Experts". Its working principle is as follows: first, use a lightweight knowledge retriever (Knowledge Retriever) to retrieve relevant professional knowledge from professional databases (such as Wikipedia or enterprise knowledge bases); then, the large-scale model takes this knowledge and questions as input and outputs Accurate answer. The knowledge "recall ability" of multi-modal knowledge extractors directly affects whether large-scale models can obtain accurate professional knowledge when answering reasoning questions.

Recently,

The Artificial Intelligence Laboratory of the Department of Information Engineering, University of Cambridge has completely open sourced the first pre-trained, universal multi-modal post-interactive knowledge retrieval PreFLMR

(Pre- trained Fine-grained Late-interaction Multi-modal Retriever). Compared with previous common models, PreFLMR has the following characteristics:

PreFLMR is a general pre-training model that can effectively solve multiple sub-tasks such as text retrieval, image retrieval and knowledge retrieval. Pre-trained on millions of levels of multi-modal data, the model performs well in multiple downstream retrieval tasks. In addition, as an excellent basic model, PreFLMR can quickly develop into an excellent domain-specific model after fine-tuning for private data.

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

Figure 2: The PreFLMR model achieves excellent multi-modal retrieval performance on multiple tasks at the same time and is an extremely strong pre-training base. Model.

2. Traditional Dense Passage Retrieval (DPR) only uses one vector to represent the query (Query) or document (Document). The FLMR model published by the Cambridge team at NeurIPS 2023 proved that the single-vector representation design of DPR can lead to fine-grained information loss, causing DPR to perform poorly on retrieval tasks that require fine information matching. Especially in multi-modal tasks, the user's query contains complex scene information, and compressing it into a one-dimensional vector greatly inhibits the expressive ability of features. PreFLMR inherits and improves the structure of FLMR, giving it unique advantages in multi-modal knowledge retrieval.

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

Figure 3: PreFLMR encodes the query (Query, 1, 2 on the left) at the character level (Token level) , 3) and Document (Document, 4 on the right), compared with the DPR system that compresses all information into one-dimensional vectors, it has the advantage of fine-grained information.

3. PreFLMR can extract documents related to the items in the picture according to the instructions entered by the user (such as "Extract documents that can be used to answer the following questions" or "Extract documents related to the items in the picture"). Relevant documents are extracted from the knowledge base to help multi-modal large models significantly improve the performance of professional knowledge question and answer tasks.

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever


Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

Figure 4: PreFLMR can simultaneously process multi-modal queries that extract documents from pictures, extract documents based on questions, and extract documents based on questions and pictures together. Task.

The Cambridge University team has open sourced three models of different sizes. The parameters of the models from small to large are: PreFLMR_ViT-B (207M), PreFLMR_ViT-L (422M) ), PreFLMR_ViT-G (2B), for users to choose according to actual conditions.

In addition to the open source model PreFLMR itself, this project has also made two important contributions in this research direction:

  1. The project At the same time, a large-scale data set for training and evaluating general knowledge retrieval, Multi-task Multi-modal Knowledge Retrieval Benchmark (M2KR), is open sourced, including 10 retrieval subtasks that have been widely studied in the academic community and a total of more than one million retrievals. right.
  2. In the paper, the Cambridge University team compared image encoders and text encoders of different sizes and performances, and summarized the expanded parameter and pre-training multi-modal post-interactive knowledge retrieval system best practices to provide empirical guidance for future general retrieval models.

The following will briefly introduce the M2KR data set, PreFLMR model and experimental result analysis.

M2KR Dataset

To pretrain and evaluate general multi-modal retrieval models at scale, the authors compiled ten publicly available datasets and convert it into a unified question-document retrieval format. The original tasks of these data sets include image captioning, multi-modal dialogue, etc. The figure below shows the questions (first row) and corresponding documents (second row) for five of the tasks.

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

Figure 5: Part of the knowledge extraction task in the M2KR dataset

PreFLMR retrieval model

Cambridge teams open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever

# Figure 6: Model structure of PreFLMR. Query is encoded as a Token-level feature. For each vector in the query matrix, PreFLMR finds the nearest vector in the document matrix and calculates the dot product, and then sums these maximum dot products to obtain the final relevance.

The PreFLMR model is based on the Fine-grained Late-interaction Multi-modal Retriever (FLMR) published in NeurIPS 2023 and undergoes model improvements and large-scale pre-training on M2KR. Compared with DPR, FLMR and PreFLMR use a matrix composed of all token vectors to characterize documents and queries. Tokens include text tokens and image tokens projected into the text space. Late interaction is an algorithm for efficiently calculating the correlation between two representation matrices. The specific method is: for each vector in the query matrix, find the nearest vector in the document matrix and calculate the dot product. These maximum dot products are then summed to obtain the final correlation. In this way, each token's representation can explicitly affect the final correlation, thus preserving token-level fine-grained information. Thanks to a dedicated post-interactive retrieval engine, PreFLMR can extract 100 relevant documents out of 400,000 documents in just 0.2 seconds, which greatly improves usability in RAG scenarios.

The pre-training of PreFLMR consists of the following four stages:

  • Text encoder pre-training: First , pre-trained a late interactive text retrieval model on MSMAARCO (a pure text knowledge retrieval data set) as the text encoder of PreFLMR.
  • Image-text projection layer pre-training: Second, train the image-text projection layer on M2KR and freeze the other parts. This stage only uses projected image vectors for retrieval, aiming to prevent the model from overly relying on text information.
  • Continuous pre-training: Then, continue training on a high-quality knowledge-intensive visual question answering task in E-VQA, M2KR Text Encoder and Image - Text projection layer. This stage aims to improve PreFLMR’s fine knowledge retrieval capabilities.
  • Universal Retrieval Training: Finally, train all weights on the entire M2KR dataset, freezing only the image encoder. At the same time, the parameters of the query text encoder and the document text encoder are unlocked and trained separately. This stage aims to improve the general retrieval capabilities of PreFLMR.

At the same time, the authors show that PreFLMR can be further fine-tuned on sub-datasets (such as OK-VQA, Infoseek) to obtain better retrieval performance on specific tasks.

Experimental results and vertical expansion

Best retrieval results: The best performing PreFLMR model uses ViT-G as the image encoder and ColBERT -base-v2 as text encoder, two billion parameters in total. It achieves performance beyond baseline models on 7 M2KR retrieval subtasks (WIT, OVEN, Infoseek, E-VQA, OKVQA, etc.).

Extended visual encoding is more effective: The author found that upgrading the image encoder ViT from ViT-B (86M) to ViT-L (307M) brought significant performance improvements, but upgrading the text encoder ColBERT from Expanding base (110M) to large (345M) resulted in performance degradation and training instability. Experimental results show that for later interactive multi-modal retrieval systems, increasing the parameters of the visual encoder brings greater returns. At the same time, using multiple layers of Cross-attention for image-text projection has the same effect as using a single layer, so the design of the image-text projection network does not need to be too complicated.

PreFLMR makes RAG more effective: On knowledge-intensive visual question answering tasks, retrieval enhancement using PreFLMR greatly improves the performance of the final system: reaching 94% on Infoseek and EVQA respectively And an effect improvement of 275%. After simple fine-tuning, the model based on BLIP-2 can beat the PALI-X model with hundreds of billions of parameters and the PaLM-Bison Lens system enhanced with Google API.

Conclusion

The PreFLMR model proposed by Cambridge Artificial Intelligence Laboratory is the first open source general late interactive multi-modal retrieval model. After pre-training on millions of data on M2KR, PreFLMR shows strong performance in multiple retrieval subtasks. The M2KR dataset, PreFLMR model weights and code are available on the project homepage https://preflmr.github.io/.

Expand resources

  • ##FLMR paper (NeurIPS 2023): https: //proceedings.neurips.cc/paper_files/paper/2023/hash/47393e8594c82ce8fd83adc672cf9872-Abstract-Conference.html
  • Code base: https://github.com/LinWeizheDragon/Retrieval- Augmented-Visual-Question-Answering
  • English version blog: https://www.jinghong-chen.net/preflmr-sota-open-sourced-multi/
  • FLMR Introduction: https://www.jinghong-chen.net/fined-grained-late-interaction-multimodal-retrieval-flmr/

The above is the detailed content of Cambridge team's open source: empowering multi-modal large model RAG applications, the first pre-trained universal multi-modal post-interactive knowledge retriever. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
How to Run LLM Locally Using LM Studio? - Analytics VidhyaHow to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationGuy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaWhat is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics Vidhya12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsAV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsPerplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingEveryone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaRocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment