Home  >  Article  >  Backend Development  >  Build a high-speed retrieval engine using python and xapian

Build a high-speed retrieval engine using python and xapian

高洛峰
高洛峰Original
2016-10-18 10:03:171186browse

First understand a few concepts: Documents, terms and posting In information retrieval (IR), the item we are trying to obtain is called "document", and each document is described by a set of terms. The two words "document" and "term" are terms in IR, which come from "library management". Usually a document is thought of as a piece of text, most likely in a machine readable form, and a term is a word or phrase used to describe the document, usually in the document. Most of them will have multiple terms. For example, if a document is related to _oral_ _hygiene_, then the following terms may exist: "tooth", "teeth", "toothbrush", "decay", "cavity" ”, “plaque” or “diet” etc.

If there is a document named D in an IR system, and this document is described by a term named t, then t is considered to index D, which can be expressed by the following formula: t->D. In an actual application, an IR system is usually a collection of multiple documents, such as D1, D2, D3..., and a collection of multiple terms, such as t1, t2, t3..., so there is the following relationship: ti -> Dj.

If a specific term indexes a specific document, it is called posting. To put it bluntly, posting is a term with position information, which may have certain uses in relevance retrieval.

Given a document named D, there is a terms list indexing it, which we call D’s term list.

Given a term named t, it indexes a list of documents, which is called t's posting list (using "Document list" may be more consistent in naming, but it sounds too vague).

In an IR system that exists on a computer, terms are stored in index files. Term can be used to effectively search its posting list. In the posting list, each document has a short identifier, which is the document id. Simply put, a posting list can be thought of as a collection of document ids, while a term list is a collection of strings. Some IR systems use numbers to represent terms internally, so in these systems, the term list is a collection of numbers. This is not the case with Xapian. It uses original terms and uses prefixes to compress storage. space.

Terms do not necessarily have to be words that appear in the document. Usually they will be converted to lowercase, and they are often processed by the stemming algorithm, so a series of words may be retrieved through a term with the value "connect" , such as "connect", "connects", "connection" or "connected", etc., and one word may also produce multiple terms. For example, you will index both the extracted stems and the unextracted words. Of course, this may only apply to European and American languages ​​such as English, French or Latin, while Chinese participles are very different. In general, the European and American language participles have the following differences from Chinese participles:

l. Take English as an example. Usually, each word in English is separated by spaces, but this is not the case in Chinese. It can even be so extreme that there are no spaces or punctuation marks in the entire article. 2. As mentioned above, "connect", "connects", "connection" or "connected" respectively mean "connection of verb nature", "connection of the third person of verb nature", "connection of name nature" or "The past tense of connection", but in Chinese, "connection" can be used to express everything, and there is almost no need for stemming. This means that most of the various parts of speech in English are rules-based, while the Chinese parts of speech are wild and unconstrained. 3. The second point is just a microcosm of the difficulty of Chinese word segmentation. It is very difficult to completely and correctly identify the semantic meaning of a sentence. For example, in the sentence "The People's Republic of China was established", it can be distinguished between "China" and "Chinese". ", "people", "republic", "founded" and other words, but "Chinese" among them actually has little to do with this sentence. It seems simple at first glance, but how can a machine understand the secrets so easily?

Values

Values ​​is a kind of metadata attached to the document. Each document can have multiple values, and these values ​​are identified by different numbers. Values ​​are designed to be quickly accessed during the matching process. They can be used for purposes such as sorting, queuing redundant duplicate documents, and range retrieval. Although there is no length limit for values, it is best to keep them as short as possible. If you just want to store a field to display as a result, it is recommended that you save them in the document's data.

Document data

Each Document has only one data, which can be data in any type of format. Of course, please convert it to a string first when storing. This may sound a bit weird, but the reality is this: if the data to be stored is in text format, it can be stored directly; if the data to be stored is various objects, please serialize it into a binary stream first and then save it, and then read it. When deserializing and reading.

posting

posting is a term with position.

# -*- coding: gb18030 -*-
import xapian
testdatas = [u'abc test python1',u'abcd testing python2']
def buildtest():
    database = xapian.WritableDatabase('indexes/', xapian.DB_CREATE_OR_OPEN)
    stemmer = xapian.Stem("english")
    for data in testdatas:
        doc = xapian.Document()
        doc.set_data(data)
        for term in data.split():
            doc.add_term(term)
        database.add_document(doc)
if __name__ == '__main__':
    buildtest()

After execution, an index library is generated in the current directory.

[sh]

[ec2-user@ip-10-167-6-221 indexes]$ ll

Total usage 52

-rw-rw-r-- 1 ec2-user ec2-user 0 0 July 28 16:06 flintlock

-rw-rw-r-- 1 ec2-user ec2-user 28 July 28 16:06 iamchert

-rw-rw-r-- 1 ec2-user ec2-user 13 July 28 16:06 postlist.baseA

-rw-rw-r-- 1 ec2-user ec2-user 14 July 28 16:06 postlist.baseB

-rw-rw-r-- 1 ec2-user ec2-user 8192 July 28 16:06 postlist.DB

-rw-rw-r-- 1 ec2-user ec2-user 13 July 28 16:06 record.baseA

-rw-rw-r-- 1 ec2-user ec2-user 14 July 28 16:06 record.baseB

-rw-rw-r-- 1 ec2-user ec2- user 8192 July 28 16:06 record.DB

-rw-rw-r-- 1 ec2-user ec2-user 13 July 28 16:06 termlist.baseA

-rw-rw-r-- 1 ec2 -user ec2-user 14 July 28 16:06 termlist.baseB

-rw-rw-r-- 1 ec2-user ec2-user 8192 July 28 16:06 termlist.DB

We will introduce how in the next article Go to query index.


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn