Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral-AI-php.cn

Home

Technology peripherals

Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 29, 2024 pm 04:16 PM

language modelgdr

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

Recently, the paper "Generative Dense Retrieval: Memory Can Be a Burden" from the Xiaohongshu search algorithm team was accepted by EACL 2024, an international conference in the field of natural language processing. For Oral, the acceptance rate is 11.32% (144/1271).

In their paper, they proposed a novel information retrieval paradigm - Generative Dense Retrieval (GDR). This paradigm can well solve the challenges faced by traditional generative retrieval (GR) when dealing with large-scale data sets. It is inspired by the memory mechanism.

#In past practice, GR relied on its unique memory mechanism to achieve in-depth interaction between queries and document libraries. However, this method that relies on language model autoregressive coding has obvious limitations when processing large-scale data, including fuzzy fine-grained document features, limited document library size, and difficulty in index update.

The GDR proposed by Xiaohongshu adopts a two-stage retrieval idea from coarse to fine. First, it uses the limited memory capacity of the language model to realize the mapping of queries to documents, and then The precise mapping of documents to documents is completed through the vector matching mechanism. GDR effectively alleviates the inherent drawbacks of GR by introducing a vector matching mechanism for dense set retrieval.

In addition, the team also designed a "memory-friendly document cluster identifier construction strategy" and a "document cluster adaptive negative sampling strategy". The retrieval performance of the two stages is improved respectively. Under multiple settings of the Natural Questions data set, GDR not only demonstrated SOTA's Recall@k performance, but also achieved good scalability while retaining the advantages of deep interaction, opening up new possibilities for future research on information retrieval. sex.

1. Background

Text search tools have important research and application value. Traditional search paradigms, such as sparse retrieval (SR) based on word matching and dense retrieval (DR) based on semantic vector matching, although each has its own merits, with the rise of pre-trained language models, based on this The generative retrieval paradigm began to emerge. The beginnings of the generative retrieval paradigm were primarily based on semantic matches between queries and candidate documents. By mapping queries and documents into the same semantic space, the retrieval problem of candidate documents is transformed into a dense retrieval of vector matching degrees. This groundbreaking retrieval paradigm takes advantage of pre-trained language models and brings new opportunities to the field of text search. However, the generative retrieval paradigm still faces challenges. On the one hand, existing pre-training

During the training process, the model uses the given query as the context to autoregressively generate identifiers of relevant documents. This process enables the model to memorize the candidate corpus. After the query enters the model, it interacts with the model parameters and is autoregressively decoded, which implicitly produces deep interaction between the query and the candidate corpus, and this deep interaction is exactly what SR and DR lack. Therefore, GR can show excellent retrieval performance when the model can accurately memorize candidate documents.

Although GR's memory mechanism is not impeccable. Through comparative experiments between the classic DR model (AR2) and the GR model (NCI), we have confirmed that the memory mechanism will bring at least three major challenges:

1) Fine-grained document features Fuzzy:

We separately calculated the probability of an error for NCI and AR2 when decoding each bit of the document identifier from coarse to fine. For AR2, we find the identifier corresponding to the most relevant document for a given query through vector matching, and then count the first error steps of the identifier to obtain the step-by-step decoding error rate corresponding to AR2. As shown in Table 1, NCI performs well in the first half of decoding, while the error rate is higher in the second half, and the opposite is true for AR2. This shows that NCI can better complete the coarse-grained mapping of the semantic space of candidate documents through the overall memory database. However, since the selected features during the training process are determined by search, its fine-grained mapping is difficult to remember accurately, so it performs poorly in fine-grained mapping.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

2) The size of the document library is limited:

As shown in Table 2, we use 334K candidate document library size (first row) and 1M candidate Document size (second row) NCI model trained and tested with R@k metric. The results show that NCI dropped 11 points on R@100, compared to AR2 which only dropped 2.8 points. In order to explore the reason why NCI performance significantly decreases as the size of the candidate document library increases, we further tested the test results of the NCI model trained on the 1M document library when using 334K as the candidate document library (third row). Compared to the first line, NCI's burden of memorizing more documents results in a significant decrease in its recall performance, indicating that the model's limited memory capacity limits its ability to memorize large-scale candidate document libraries.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

3) Index update difficulty:

When new When a document needs to be added to the candidate library, the document identifier needs to be updated, and the model needs to be retrained to re-memorize all documents. Otherwise, outdated mappings (query to document identifier and document identifier to document) will significantly reduce retrieval performance.

The above problems hinder the application of GR in real scenarios. For this reason, after analysis, we believe that the matching mechanism of DR has a complementary relationship with the memory mechanism, so we consider introducing it into GR to retain the memory mechanism while suppressing its disadvantages. We proposed a new paradigm of generative dense retrieval (Generative Dense Retrieval, GDR):

We designed an overall two-stage retrieval from coarse to fine The framework uses the memory mechanism to achieve inter-cluster matching (mapping of queries to document clusters), and uses the vector matching mechanism to complete intra-cluster matching (mapping of document clusters to documents).
In order to assist the model in memorizing the candidate document library, we have constructed a memory-friendly document cluster identifier construction strategy, using the model memory capacity as the benchmark to control the division granularity of document clusters and increase inter-cluster matching. Effect.
In the training phase, we propose an adaptive negative sampling strategy for document clusters based on the characteristics of two-stage retrieval, which enhances the weight of negative samples within the cluster and increases the matching effect within the cluster.

2.1 Inter-cluster matching based on memory mechanism

Taking the query as input, we use the language model Memorize the candidate document library and autoregressively generate k relevant document clusters (CID) to complete the following mapping:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

In this process, CID The generation probability is:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

where

is all query embeddings generated by the encoder,

is the one-dimensional query representation produced by the encoder. This probability is also stored as an inter-cluster matching score and participates in subsequent operations. Based on this, we use standard cross-entropy loss to train the model:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

2.2 Intra-cluster matching based on vector matching mechanism

We further retrieve candidate documents from candidate document clusters and complete intra-cluster matching:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

We introduce a document encoder to extract candidate documents Characterization, this process will be completed offline. Based on this, calculate the similarity between the documents in the cluster and the query as the within-cluster matching score:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

In this process, NLL loss is used to train the model:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

Finally, we calculate the weighted value of the inter-cluster matching score and the intra-cluster matching score of the document and sort them, Select the Top K among them as the retrieved relevant documents:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

where beta is set to 1 in our experiment.

2.3 Memory-friendly document cluster identifier construction strategy

#In order to make full use of the limited memory capacity of the model to achieve deep interaction between the query and the candidate document library, we propose Memory-friendly document cluster identifier construction strategy. This strategy first uses the model memory capacity as the benchmark to calculate the upper limit of the number of documents in the cluster:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

On this basis, it is further constructed through the K-means algorithm Document cluster identifier to ensure that the memory burden of the model does not exceed its memory capacity:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

2.4 Negative sampling strategy for document cluster adaptation

The two-stage retrieval framework of GDR determines that the negative samples within the cluster account for a larger proportion during the intra-cluster matching process. To this end, in the second stage of training, we use document cluster division as the benchmark and explicitly enhance the weight of negative samples within the cluster to obtain better intra-cluster matching effect:

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

3. Experiment

The data set used in the experiment is Natural Questions (NQ), It contains 58K training pairs (query and related documents) and 6K validation pairs, along with a library of 21M candidate documents. Each query has multiple related documents, which places higher requirements on the recall performance of the model. To evaluate the performance of GDR on document bases of different sizes, we constructed different settings such as NQ334K, NQ1M, NQ2M, and NQ4M by adding the remaining passages from the full 21M corpus to NQ334K. GDR generates CIDs on each dataset separately to prevent the semantic information of the larger candidate document library from leaking into the smaller corpus. We adopt BM25 (Anserini implementation) as the SR baseline, DPR and AR2 as the DR baseline, and NCI as the GR baseline. Evaluation metrics include R@k and Acc@k.

3.1 Main Experimental Results

On the NQ data set, GDR improves the R@k indicator by an average of 3.0, while on the Acc@ k index ranks second. This shows that GDR maximizes the advantages of the memory mechanism in deep interaction and the matching mechanism in fine-grained feature discrimination through a coarse-to-fine retrieval process.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

3.2 Expansion to a larger corpus

We noticed that when the candidate corpus is expanded to a larger At the scale, the R@100 decrease rate of SR and DR remains below 4.06%, while the decrease rate of GR exceeds 15.25% in all three expansion directions. In contrast, GDR achieves an average R@100 reduction rate of 3.50%, which is similar to SR and DR by focusing the memory content on a fixed volume of corpus coarse-grained features.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

3.3 Ablation experiment

GDR- in Table 3 bert and GDR-ours respectively represent the corresponding model performance under the traditional and our CID construction strategies. Experiments have proven that using a memory-friendly document cluster identifier construction strategy can significantly reduce the memory burden, thereby leading to better retrieval performance. In addition, Table 4 shows that the document cluster adaptive negative sampling strategy used in GDR training enhances fine-grained matching capabilities by providing more discriminative signals within document clusters.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

3.4 New document added

When there is a new document When adding the candidate document library, GDR adds the new document to the nearest document cluster cluster center and assigns it a corresponding identifier. At the same time, the document encoder extracts the vector representation and updates the vector index, thereby completing the rapid expansion of new documents. As shown in Table 6, in the setting of adding new documents to the candidate corpus, NCI's R@100 drops by 18.3 percentage points, while GDR's performance only drops by 1.9 percentage points. This shows that GDR alleviates the difficult scalability of the memory mechanism by introducing a matching mechanism and maintains good recall effects without retraining the model.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

3.5 Limitations

Limited by the characteristics of language model autoregressive generation, although GDR introduces a vector matching mechanism in the second stage, compared with GR, it achieves The retrieval efficiency is significantly improved, but there is still a large room for improvement compared to DR and SR. We look forward to more research in the future to help alleviate the delay problem caused by the introduction of memory mechanisms into the retrieval framework.

小红书从记忆机制解读信息检索，提出新范式获得 EACL Oral

4. Conclusion

In this study, we deeply explored the dual role of memory mechanism in information retrieval. Blade effect: On the one hand, this mechanism realizes in-depth interaction between the query and the candidate document library, making up for the shortcomings of intensive retrieval; on the other hand, the limited memory capacity of the model and the complexity of updating the index make it difficult to deal with large-scale and dynamic problems. It is difficult to change the candidate document library. In order to solve this problem, we innovatively combine the memory mechanism and the vector matching mechanism in a hierarchical manner to achieve the effect of leveraging the strengths and avoiding weaknesses of the two and complementing each other.

# We propose a new text retrieval paradigm, Generative Dense Retrieval (GDR). GDR This paradigm performs a two-stage retrieval from coarse to fine for a given query. First, the memory mechanism autoregressively generates document cluster identifiers to map the query to document clusters, and then uses vector matching. The mechanism calculates the similarity between the query and the document to complete the mapping of document clusters to documents.

The memory-friendly document cluster identifier construction strategy ensures that the memory burden of the model does not exceed its memory capacity and improves the inter-cluster matching effect. The document cluster adaptive negative sampling strategy enhances the training signal for distinguishing negative samples within the cluster and increases the matching effect within the cluster. Extensive experiments have proven that GDR can achieve excellent retrieval performance on large-scale candidate document libraries and can efficiently respond to document library updates.

As a successful attempt to integrate the advantages of traditional retrieval methods, the generative intensive retrieval paradigm has good recall performance, strong scalability, and robust performance in scenarios with massive candidate document libraries. advantage. As large language models continue to improve in their understanding and generation capabilities, the performance of generative intensive retrieval will be further improved, opening up a broader world for information retrieval.

Paper address: https://www.php.cn/link/9e69fd6d1c5d1cef75ffbe159c1f322e

5. Author introduction

Yuan Peiwen
Currently studying as a Ph.D. student at Beijing Institute of Technology, and an intern in the Xiaohongshu community search team. He has published many first-author papers in NeurIPS, ICLR, AAAI, EACL, etc. The main research directions are large language model reasoning and evaluation, and information retrieval.
王Xinglin
##Now a Ph.D. student at Beijing Institute of Technology, Xiaohong An intern in the book community search team, he published several papers in EACL, NeurIPS, ICLR, etc., and won the second place in the evaluation track at the International Dialogue Technology Challenge DSTC11. The main research directions are large language model reasoning and evaluation, and information retrieval.
Feng Shaoxiong
Responsible for Xiaohongshu community search vector recall. Graduated from Beijing Institute of Technology with a Ph.D., he has published several papers in top conferences/journals in the field of machine learning and natural language processing such as ICLR, AAAI, ACL, EMNLP, NAACL, EACL, KBS, etc. The main research directions include large language model evaluation, inference distillation, generative retrieval, open domain dialogue generation, etc.
Daoxuan
Head of the Xiaohongshu transaction search team. Graduated from Zhejiang University with a Ph.D., he has published several first-author papers at top conferences in the field of machine learning such as NeurIPS and ICML, and has been a reviewer for many top conferences/journals for a long time. The main business covers content search, e-commerce search, live broadcast search, etc.
##Zeng Shu# graduated from the Department of Electronics of Tsinghua University with a master's degree and is engaged in natural language processing, recommendation, search and other related fields in the Internet field Algorithm work in the direction of Xiaohongshu Community Search is currently responsible for technical directions such as recall and vertical search.

The above is the detailed content of Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles