Home  >  Article  >  Technology peripherals  >  Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral

Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral

WBOY
WBOYforward
2024-04-29 16:16:071177browse

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

Recently, the paper "Generative Dense Retrieval: Memory Can Be a Burden" from the Xiaohongshu search algorithm team was accepted by EACL 2024, an international conference in the field of natural language processing. For Oral, the acceptance rate is 11.32% (144/1271).

In their paper, they proposed a novel information retrieval paradigm - Generative Dense Retrieval (GDR). This paradigm can well solve the challenges faced by traditional generative retrieval (GR) when dealing with large-scale data sets. It is inspired by the memory mechanism.

#In past practice, GR relied on its unique memory mechanism to achieve in-depth interaction between queries and document libraries. However, this method that relies on language model autoregressive coding has obvious limitations when processing large-scale data, including fuzzy fine-grained document features, limited document library size, and difficulty in index update.

The GDR proposed by Xiaohongshu adopts a two-stage retrieval idea from coarse to fine. First, it uses the limited memory capacity of the language model to realize the mapping of queries to documents, and then The precise mapping of documents to documents is completed through the vector matching mechanism. GDR effectively alleviates the inherent drawbacks of GR by introducing a vector matching mechanism for dense set retrieval.

In addition, the team also designed a "memory-friendly document cluster identifier construction strategy" and a "document cluster adaptive negative sampling strategy". The retrieval performance of the two stages is improved respectively. Under multiple settings of the Natural Questions data set, GDR not only demonstrated SOTA's Recall@k performance, but also achieved good scalability while retaining the advantages of deep interaction, opening up new possibilities for future research on information retrieval. sex.

1. Background

Text search tools have important research and application value. Traditional search paradigms, such as sparse retrieval (SR) based on word matching and dense retrieval (DR) based on semantic vector matching, although each has its own merits, with the rise of pre-trained language models, based on this The generative retrieval paradigm began to emerge. The beginnings of the generative retrieval paradigm were primarily based on semantic matches between queries and candidate documents. By mapping queries and documents into the same semantic space, the retrieval problem of candidate documents is transformed into a dense retrieval of vector matching degrees. This groundbreaking retrieval paradigm takes advantage of pre-trained language models and brings new opportunities to the field of text search. However, the generative retrieval paradigm still faces challenges. On the one hand, existing pre-training

During the training process, the model uses the given query as the context to autoregressively generate identifiers of relevant documents. This process enables the model to memorize the candidate corpus. After the query enters the model, it interacts with the model parameters and is autoregressively decoded, which implicitly produces deep interaction between the query and the candidate corpus, and this deep interaction is exactly what SR and DR lack. Therefore, GR can show excellent retrieval performance when the model can accurately memorize candidate documents.

Although GR's memory mechanism is not impeccable. Through comparative experiments between the classic DR model (AR2) and the GR model (NCI), we have confirmed that the memory mechanism will bring at least three major challenges:

1) Fine-grained document features Fuzzy:

We separately calculated the probability of an error for NCI and AR2 when decoding each bit of the document identifier from coarse to fine. For AR2, we find the identifier corresponding to the most relevant document for a given query through vector matching, and then count the first error steps of the identifier to obtain the step-by-step decoding error rate corresponding to AR2. As shown in Table 1, NCI performs well in the first half of decoding, while the error rate is higher in the second half, and the opposite is true for AR2. This shows that NCI can better complete the coarse-grained mapping of the semantic space of candidate documents through the overall memory database. However, since the selected features during the training process are determined by search, its fine-grained mapping is difficult to remember accurately, so it performs poorly in fine-grained mapping.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

2) The size of the document library is limited:

As shown in Table 2, we use 334K candidate document library size (first row) and 1M candidate Document size (second row) NCI model trained and tested with R@k metric. The results show that NCI dropped 11 points on R@100, compared to AR2 which only dropped 2.8 points. In order to explore the reason why NCI performance significantly decreases as the size of the candidate document library increases, we further tested the test results of the NCI model trained on the 1M document library when using 334K as the candidate document library (third row). Compared to the first line, NCI's burden of memorizing more documents results in a significant decrease in its recall performance, indicating that the model's limited memory capacity limits its ability to memorize large-scale candidate document libraries.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral


3) Index update difficulty:

When new When a document needs to be added to the candidate library, the document identifier needs to be updated, and the model needs to be retrained to re-memorize all documents. Otherwise, outdated mappings (query to document identifier and document identifier to document) will significantly reduce retrieval performance.

The above problems hinder the application of GR in real scenarios. For this reason, after analysis, we believe that the matching mechanism of DR has a complementary relationship with the memory mechanism, so we consider introducing it into GR to retain the memory mechanism while suppressing its disadvantages. We proposed a new paradigm of generative dense retrieval (Generative Dense Retrieval, GDR):

  • We designed an overall two-stage retrieval from coarse to fine The framework uses the memory mechanism to achieve inter-cluster matching (mapping of queries to document clusters), and uses the vector matching mechanism to complete intra-cluster matching (mapping of document clusters to documents).
  • In order to assist the model in memorizing the candidate document library, we have constructed a memory-friendly document cluster identifier construction strategy, using the model memory capacity as the benchmark to control the division granularity of document clusters and increase inter-cluster matching. Effect.
  • In the training phase, we propose an adaptive negative sampling strategy for document clusters based on the characteristics of two-stage retrieval, which enhances the weight of negative samples within the cluster and increases the matching effect within the cluster.

2.1 Inter-cluster matching based on memory mechanism

Taking the query as input, we use the language model Memorize the candidate document library and autoregressively generate k relevant document clusters (CID) to complete the following mapping:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

In this process, CID The generation probability is:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

where

is all query embeddings generated by the encoder,

is the one-dimensional query representation produced by the encoder. This probability is also stored as an inter-cluster matching score and participates in subsequent operations. Based on this, we use standard cross-entropy loss to train the model:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

2.2 Intra-cluster matching based on vector matching mechanism

We further retrieve candidate documents from candidate document clusters and complete intra-cluster matching:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

We introduce a document encoder to extract candidate documents Characterization, this process will be completed offline. Based on this, calculate the similarity between the documents in the cluster and the query as the within-cluster matching score:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

In this process, NLL loss is used to train the model:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

Finally, we calculate the weighted value of the inter-cluster matching score and the intra-cluster matching score of the document and sort them, Select the Top K among them as the retrieved relevant documents:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

where beta is set to 1 in our experiment.

2.3 Memory-friendly document cluster identifier construction strategy

#In order to make full use of the limited memory capacity of the model to achieve deep interaction between the query and the candidate document library, we propose Memory-friendly document cluster identifier construction strategy. This strategy first uses the model memory capacity as the benchmark to calculate the upper limit of the number of documents in the cluster:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

On this basis, it is further constructed through the K-means algorithm Document cluster identifier to ensure that the memory burden of the model does not exceed its memory capacity:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

2.4 Negative sampling strategy for document cluster adaptation

The two-stage retrieval framework of GDR determines that the negative samples within the cluster account for a larger proportion during the intra-cluster matching process. To this end, in the second stage of training, we use document cluster division as the benchmark and explicitly enhance the weight of negative samples within the cluster to obtain better intra-cluster matching effect:

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral



3. Experiment

The data set used in the experiment is Natural Questions (NQ), It contains 58K training pairs (query and related documents) and 6K validation pairs, along with a library of 21M candidate documents. Each query has multiple related documents, which places higher requirements on the recall performance of the model. To evaluate the performance of GDR on document bases of different sizes, we constructed different settings such as NQ334K, NQ1M, NQ2M, and NQ4M by adding the remaining passages from the full 21M corpus to NQ334K. GDR generates CIDs on each dataset separately to prevent the semantic information of the larger candidate document library from leaking into the smaller corpus. We adopt BM25 (Anserini implementation) as the SR baseline, DPR and AR2 as the DR baseline, and NCI as the GR baseline. Evaluation metrics include R@k and Acc@k.

3.1 Main Experimental Results

On the NQ data set, GDR improves the R@k indicator by an average of 3.0, while on the Acc@ k index ranks second. This shows that GDR maximizes the advantages of the memory mechanism in deep interaction and the matching mechanism in fine-grained feature discrimination through a coarse-to-fine retrieval process.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

3.2 Expansion to a larger corpus

We noticed that when the candidate corpus is expanded to a larger At the scale, the R@100 decrease rate of SR and DR remains below 4.06%, while the decrease rate of GR exceeds 15.25% in all three expansion directions. In contrast, GDR achieves an average R@100 reduction rate of 3.50%, which is similar to SR and DR by focusing the memory content on a fixed volume of corpus coarse-grained features.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral


3.3 Ablation experiment

GDR- in Table 3 bert and GDR-ours respectively represent the corresponding model performance under the traditional and our CID construction strategies. Experiments have proven that using a memory-friendly document cluster identifier construction strategy can significantly reduce the memory burden, thereby leading to better retrieval performance. In addition, Table 4 shows that the document cluster adaptive negative sampling strategy used in GDR training enhances fine-grained matching capabilities by providing more discriminative signals within document clusters.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral


3.4 New document added

When there is a new document When adding the candidate document library, GDR adds the new document to the nearest document cluster cluster center and assigns it a corresponding identifier. At the same time, the document encoder extracts the vector representation and updates the vector index, thereby completing the rapid expansion of new documents. As shown in Table 6, in the setting of adding new documents to the candidate corpus, NCI's R@100 drops by 18.3 percentage points, while GDR's performance only drops by 1.9 percentage points. This shows that GDR alleviates the difficult scalability of the memory mechanism by introducing a matching mechanism and maintains good recall effects without retraining the model.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral


3.5 Limitations

Limited by the characteristics of language model autoregressive generation, although GDR introduces a vector matching mechanism in the second stage, compared with GR, it achieves The retrieval efficiency is significantly improved, but there is still a large room for improvement compared to DR and SR. We look forward to more research in the future to help alleviate the delay problem caused by the introduction of memory mechanisms into the retrieval framework.

小红书从记忆机制解读信息检索,提出新范式获得 EACL Oral

4. Conclusion

In this study, we deeply explored the dual role of memory mechanism in information retrieval. Blade effect: On the one hand, this mechanism realizes in-depth interaction between the query and the candidate document library, making up for the shortcomings of intensive retrieval; on the other hand, the limited memory capacity of the model and the complexity of updating the index make it difficult to deal with large-scale and dynamic problems. It is difficult to change the candidate document library. In order to solve this problem, we innovatively combine the memory mechanism and the vector matching mechanism in a hierarchical manner to achieve the effect of leveraging the strengths and avoiding weaknesses of the two and complementing each other.

# We propose a new text retrieval paradigm, Generative Dense Retrieval (GDR). GDR This paradigm performs a two-stage retrieval from coarse to fine for a given query. First, the memory mechanism autoregressively generates document cluster identifiers to map the query to document clusters, and then uses vector matching. The mechanism calculates the similarity between the query and the document to complete the mapping of document clusters to documents.

The memory-friendly document cluster identifier construction strategy ensures that the memory burden of the model does not exceed its memory capacity and improves the inter-cluster matching effect. The document cluster adaptive negative sampling strategy enhances the training signal for distinguishing negative samples within the cluster and increases the matching effect within the cluster. Extensive experiments have proven that GDR can achieve excellent retrieval performance on large-scale candidate document libraries and can efficiently respond to document library updates.

As a successful attempt to integrate the advantages of traditional retrieval methods, the generative intensive retrieval paradigm has good recall performance, strong scalability, and robust performance in scenarios with massive candidate document libraries. advantage. As large language models continue to improve in their understanding and generation capabilities, the performance of generative intensive retrieval will be further improved, opening up a broader world for information retrieval.

Paper address: https://www.php.cn/link/9e69fd6d1c5d1cef75ffbe159c1f322e

5. Author introduction

  • Yuan Peiwen
    Currently studying as a Ph.D. student at Beijing Institute of Technology, and an intern in the Xiaohongshu community search team. He has published many first-author papers in NeurIPS, ICLR, AAAI, EACL, etc. The main research directions are large language model reasoning and evaluation, and information retrieval.
  • 王Xinglin
    ##Now a Ph.D. student at Beijing Institute of Technology, Xiaohong An intern in the book community search team, he published several papers in EACL, NeurIPS, ICLR, etc., and won the second place in the evaluation track at the International Dialogue Technology Challenge DSTC11. The main research directions are large language model reasoning and evaluation, and information retrieval.
  • Feng Shaoxiong
    Responsible for Xiaohongshu community search vector recall. Graduated from Beijing Institute of Technology with a Ph.D., he has published several papers in top conferences/journals in the field of machine learning and natural language processing such as ICLR, AAAI, ACL, EMNLP, NAACL, EACL, KBS, etc. The main research directions include large language model evaluation, inference distillation, generative retrieval, open domain dialogue generation, etc.
  • Daoxuan
    Head of the Xiaohongshu transaction search team. Graduated from Zhejiang University with a Ph.D., he has published several first-author papers at top conferences in the field of machine learning such as NeurIPS and ICML, and has been a reviewer for many top conferences/journals for a long time. The main business covers content search, e-commerce search, live broadcast search, etc.
  • ##Zeng Shu# graduated from the Department of Electronics of Tsinghua University with a master's degree and is engaged in natural language processing, recommendation, search and other related fields in the Internet field Algorithm work in the direction of Xiaohongshu Community Search is currently responsible for technical directions such as recall and vertical search.

The above is the detailed content of Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete