在專案中為了支持搜尋服務,我們使用xapian作為後端的搜尋引擎.其因性能良好以及易用受到大家歡迎.下面是基本代碼:
import xapian import posixpath def get_db_path(): XAPIAN_ROOT = '/tmp/' xapian_user_database_path = posixpath.join(XAPIAN_ROOT, u'user_index') return xapian_user_database_path def add_document(database, words): doc = xapian.Document() for w in words: doc.add_term(w) database.add_document(doc) def build_index(): user_database = xapian.WritableDatabase(get_db_path(), xapian.DB_CREATE_OR_OPEN) words = ['a', 'b', 'c'] add_document(user_database, words) def search(words, offset=0, length=10): user_database = xapian.Database(get_db_path()) enquire = xapian.Enquire(user_database) query = xapian.Query(xapian.Query.OP_AND, words) enquire.set_query(query) return enquire.get_mset(int(offset), int(length)) def _show_q_results(matches): print '%i results found.' % matches.get_matches_estimated() print 'Results 1 - %i:' % matches.size() for match in matches: print '%i: %i%% docid=%i [%s]' % (match.rank + 1, match.percent, match.docid, match.document.get_data() ) if __name__ == '__main__': #index build_index() #search _show_q_results(search(['a','b']))
雖然使用起來很簡單,但是我一直對於他的存儲引擎有點好奇,所以看了一下最新的儲存引擎brass的實現.下面是整個資料目錄的層次結構:
/tmp/user_index
flintlock
iamchert
postlist.baseA
postlist.baseB t
postlist.baseA
postlist.baseB
所有post term 到docid的映射.
record.baseA
record.baseB
record.DB //儲存所有docid 到對應的資料的對應
termlist.baseA
termlist.baseB termlist.baseB
termlist.baseA
termlist.baseB termlist的映射.
brass存儲引擎採用的數據結構是BTree.所以上面每個*.DB都是存儲一個BTree的.*.baseA/B則是存儲相應的.DB的meta信息.包括這個大的數據文件有哪些資料塊已經被使用,哪些空閒的bitmap,以及版本號等等相關信息.
BTree在xapian中表示為N Level.每個level對應於BTree的一層.並且維護這一層的一個cursor .用於指向目前正在存取的某一個資料區塊,以及資料區塊中的某一個位置.其中每個資料區塊的資料結構如下:
from @brass_table.cc /* A B-tree comprises (a) a base file, containing essential information (Block size, number of the B-tree root block etc), (b) a bitmap, the Nth bit of the bitmap being set if the Nth block of the B-tree file is in use, and (c) a file DB containing the B-tree proper. The DB file is divided into a sequence of equal sized blocks, numbered 0, 1, 2 ... some of which are free, some in use. Those in use are arranged in a tree. Each block, b, has a structure like this: R L M T D o1 o2 o3 ... oN <gap> [item] .. [item] .. [item] ... <---------- D ----------> <-M-> And then, R = REVISION(b) is the revision number the B-tree had when the block was written into the DB file. L = GET_LEVEL(b) is the level of the block, which is the number of levels towards the root of the B-tree structure. So leaf blocks have level 0 and the one root block has the highest level equal to the number of levels in the B-tree. M = MAX_FREE(b) is the size of the gap between the end of the directory and the first item of data. (It is not necessarily the maximum size among the bits of space that are free, but I can't think of a better name.) T = TOTAL_FREE(b)is the total amount of free space left in b. D = DIR_END(b) gives the offset to the end of the directory. o1, o2 ... oN are a directory of offsets to the N items held in the block. The items are key-tag pairs, and as they occur in the directory are ordered by the keys. An item has this form: I K key x C tag <--K--> <------I------> A long tag presented through the API is split up into C tags small enough to be accommodated in the blocks of the B-tree. The key is extended to include a counter, x, which runs from 1 to C. The key is preceded by a length, K, and the whole item with a length, I, as depicted above.