Home  >  Article  >  Backend Development  >  Whoosh: A lightweight search tool for Python

Whoosh: A lightweight search tool for Python

PHPz
PHPzforward
2023-04-14 21:07:011620browse

Whoosh: A lightweight search tool for Python

Whoosh Introduction

Whoosh was created by Matt Chaput. It started as a simple and fast search service tool for the online documentation of the Houdini 3D animation software package. It has gradually become a mature search solution tool and has been open source.

Whoosh is written purely in Python. It is a flexible, convenient and lightweight search engine tool. It now supports both Python2 and 3. Its advantages are as follows:

  • Whoosh is purely written in Python, but it is very fast. It only requires a Python environment and does not require a compiler;
  • The Okapi BM25F sorting algorithm is used by default, and other sorting algorithms are also supported;
  • Compared with other search engines, Whoosh will create smaller index files;
  • The index file encoding in Whoosh must be unicode;
  • Whoosh can store any Python object.

The official introduction website of Whoosh is: https://whoosh.readthedocs.io/en/latest/intro.html. Compared with mature search engine tools such as ElasticSearch or Solr, Whoosh is lighter and simpler to operate, and can be considered for use in small search projects.

Index & query

For those familiar with ES, the two important aspects of search are mapping and query, that is, index construction and query. Behind the scenes are complex index storage, Query parsing and sorting algorithms, etc. If you have experience in ES, then Whoosh is very easy to get started with.

According to the author’s understanding and Whoosh’s official documentation, the main introductory uses of Whoosh are index and query. One of the powerful features of a search engine is that it can provide full-text search, which depends on the sorting algorithm, such as BM25, and also depends on how we store fields. Therefore, when index is used as a noun, it refers to the index of the field, and when index is used as a verb, it refers to establishing the index of the field. The query will use the sorting algorithm to give reasonable search results based on the statements we need to query.

Regarding the use of Whoosh, detailed instructions have been given in the official documents. The author only gives a simple example here to illustrate how Whoosh can easily improve our search experience.

Sample code

Data

The sample data for this project is poetry.csv. The following picture is the first ten rows of the data set:

Whoosh: A lightweight search tool for Python

poem.csv

Fields

Based on the characteristics of the data set, we create four fields (fields): title, dynasty, poet, content. The created code is as follows:

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json
# 创建schema, stored为True表示能够被检索
schema = Schema(title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
 dynasty=ID(stored=True),
 poet=ID(stored=True),
 content=TEXT(stored=True, analyzer=ChineseAnalyzer())
 )

Among them, the ID can only be a unit value and cannot be divided into several words. It is often used for file paths, URLs, dates, and categories;

The text of the TEXT file Content, index and store text, and support word search; Analyzer selects the stuttering Chinese word segmenter.

Create index file

Next, we need to create an index file. We use the program to first parse the poem.csv file, convert it into index, and write it to the indexdir directory. The Python code is as follows:

# 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
 texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]
# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
 os.mkdir(indexdir)
ix = create_in(indexdir, schema)
# 按照schema定义信息,增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
 title, dynasty, poet, content = texts[i]
 writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After the index is successfully created, the indexdir directory will be generated, which contains the index files for each field of the above poem.csv data.

Query

After the index is successfully created, we will use it to query.

For example, if we want to query the poems containing the bright moon in the content, we can enter the following code:

# 创建一个检索器
searcher = ix.searcher()
# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
 print(json.dumps(results[i].fields(), ensure_ascii=False))

The output results are as follows:

一共发现44份文档。
前10份文档如下:
{"content": "床前明月光,疑是地上霜。举头望明月,低头思故乡。", "dynasty": "唐代", "poet": "李白 ", "title": "静夜思"}
{"content": "边草,边草,边草尽来兵老。山南山北雪晴,千里万里月明。明月,明月,胡笳一声愁绝。", "dynasty": "唐代", "poet": "戴叔伦 ", "title": "调笑令·边草"}
{"content": "独坐幽篁里,弹琴复长啸。深林人不知,明月来相照。", "dynasty": "唐代", "poet": "王维 ", "title": "竹里馆"}
{"content": "汉江明月照归人,万里秋风一叶身。休把客衣轻浣濯,此中犹有帝京尘。", "dynasty": "明代", "poet": "边贡 ", "title": "重赠吴国宾"}
{"content": "秦时明月汉时关,万里长征人未还。但使龙城飞将在,不教胡马度阴山。", "dynasty": "唐代", "poet": "王昌龄 ", "title": "出塞二首·其一"}
{"content": "京口瓜洲一水间,钟山只隔数重山。春风又绿江南岸,明月何时照我还?", "dynasty": "宋代", "poet": "王安石 ", "title": "泊船瓜洲"}
{"content": "四顾山光接水光,凭栏十里芰荷香。清风明月无人管,并作南楼一味凉。", "dynasty": "宋代", "poet": "黄庭坚 ", "title": "鄂州南楼书事"}
{"content": "青山隐隐水迢迢,秋尽江南草未凋。二十四桥明月夜,玉人何处教吹箫?", "dynasty": "唐代", "poet": "杜牧 ", "title": "寄扬州韩绰判官"}
{"content": "露气寒光集,微阳下楚丘。猿啼洞庭树,人在木兰舟。广泽生明月,苍山夹乱流。云中君不见,竟夕自悲秋。", "dynasty": "唐代", "poet": "马戴 ", "title": "楚江怀古三首·其一"}
{"content": "海上生明月,天涯共此时。情人怨遥夜,竟夕起相思。灭烛怜光满,披衣觉露滋。不堪盈手赠,

The above is the detailed content of Whoosh: A lightweight search tool for Python. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete