search
HomeBackend DevelopmentPython TutorialThe use of Python lightweight search tool Whoosh (summary sharing)

This article brings you relevant knowledge about Python. It will briefly introduce Whoosh, a lightweight search tool in Python, and give the corresponding usage example code, as follows Let's take a look, I hope it will be helpful to everyone.

The use of Python lightweight search tool Whoosh (summary sharing)

[Related recommendations: Python3 video tutorial ]

This article will briefly introduce Whoosh, a lightweight search tool in Python. And give the corresponding usage example code.

Whoosh Introduction

Whoosh was created by Matt Chaput. It started as a simple and fast search service tool for the online documentation of the Houdini 3D animation software package, and then slowly became a mature The search solution tool has been open sourced.

Whoosh is purely written in Python. It is a flexible, convenient and lightweight search engine tool. It now supports both Python2 and 3. Its advantages are as follows:

  • Whoosh is purely written in Python, but it is very fast. It only requires a Python environment and does not require a compiler;
  • The Okapi BM25F sorting algorithm is used by default, and other sorting algorithms are also supported;
  • Compared with other search engines, Whoosh will create smaller index files;
  • The index file encoding in Whoosh must be unicode;
  • Whoosh can store any Python object.

Whoosh’s official introduction website is: https://whoosh.readthedocs.io/en/latest/intro.html. Compared with mature search engine tools such as ElasticSearch or Solr, Whoosh is lighter and simpler to operate, and can be considered for use in small search projects.

Index & query

For those who are familiar with ES, the two important aspects of search are mapping and query, that is, index construction and query. Behind the scenes are complex index storage, Query parsing and sorting algorithms, etc. If you have experience in ES, then Whoosh is very easy to get started with.

According to the author's understanding and the official documentation of Whoosh, the main entry-level uses of Whoosh are index and query. One of the powerful features of a search engine is that it can provide full-text search, which depends on the sorting algorithm, such as BM25, and also depends on how we store fields. Therefore, when index is used as a noun, it refers to the index of the field, and when index is used as a verb, it refers to establishing the index of the field. The query will use the sorting algorithm to give reasonable search results based on the statements we need to query.

Regarding the use of Whoosh, detailed instructions have been given in the official documents. The author only gives a simple example here to illustrate how Whoosh can easily improve our search experience.

Sample code

Data

The sample data for this project is poetry.csv. The following picture is the first ten rows of the data set:

Fields

According to the characteristics of the data set, we create four fields (fields): title, dynasty, poet, content. The created code is as follows:

# -*- coding: utf-8 -*-
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

# 创建schema, stored为True表示能够被检索
schema = Schema(title=TEXT(stored=True, analyzer=ChineseAnalyzer()),
                dynasty=ID(stored=True),
                poet=ID(stored=True),
                content=TEXT(stored=True, analyzer=ChineseAnalyzer())
                )

Among them, the ID can only be a unit value and cannot be divided into several words. It is often used for file paths, URLs, dates, and categories;

The text of the TEXT file Content, index and store text, and support word search; Analyzer selects the stuttering Chinese word segmenter.

Create index file

Next, we need to create an index file. We use the program to first parse the poem.csv file, convert it into index, and write it to the indexdir directory. The Python code is as follows:

# 解析poem.csv文件
with open('poem.csv', 'r', encoding='utf-8') as f:
    texts = [_.strip().split(',') for _ in f.readlines() if len(_.strip().split(',')) == 4]

# 存储schema信息至indexdir目录
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

# 按照schema定义信息,增加需要建立索引的文档
writer = ix.writer()
for i in range(1, len(texts)):
    title, dynasty, poet, content = texts[i]
    writer.add_document(title=title, dynasty=dynasty, poet=poet, content=content)
writer.commit()

After the index is successfully created, the indexdir directory will be generated, which contains the index files for each field of the above poem.csv data.

Query

After the index is successfully created, we will use it to query.

For example, if we want to query the poems containing 明月 in the content, we can enter the following code:

# 创建一个检索器
searcher = ix.searcher()

# 检索content中出现'明月'的文档
results = searcher.find("content", "明月")
print('一共发现%d份文档。' % len(results))
for i in range(min(10, len(results))):
    print(json.dumps(results[i].fields(), ensure_ascii=False))

The output results are as follows:

A total of 44 documents were found.
The first 10 documents are as follows:
{"content": "There is bright moonlight in front of the bed, which is suspected to be frost on the ground. Look up at the bright moon and lower your head to think about your hometown.", "dynasty": "Tang Dynasty", "poet ": "Li Bai ", "title": "Quiet Night Thoughts"}
{"content": "The grass on the edge, the grass on the edge, the grass on the edge are all here. The snow is clear in the south of the mountain and in the north, and the moon is bright for thousands of miles. The bright moon, the bright moon , the Hujia screamed with sorrow.", "dynasty": "Tang Dynasty", "poet": "Dai Shulun", "title": "Tiao Xiaoling·Biancao"}
{"content": "Sitting alone in the quiet bamboo Inside, I play the piano and whistle loudly. People in the deep forest don't know that the bright moon comes to shine.", "dynasty": "Tang Dynasty", "poet": "Wang Wei", "title": "Zhuli Pavilion"}
{" content": "The bright moon of the Han River shines on people returning home, and the autumn wind spreads across thousands of miles. Don't wash your guest clothes lightly, there are still dust from the imperial capital.", "dynasty": "Ming Dynasty", "poet": "Bian Gong", "title": "A heavy gift to Wu Guobin"}
{"content": "The bright moon of the Qin Dynasty and the Pass of the Han Dynasty, and the people who marched thousands of miles have not returned. But the flying generals of Dragon City are here, and they will not teach Hu Ma to cross the Yin Mountains.", "dynasty": "Tang Dynasty", "poet": "Wang Changling", "title": "Two poems out of the fortress·One"}
{"content": "Between Jingkou and Guazhou, there is only one water, Zhongshan Mountain Countless mountains. The spring breeze turns green to the south bank of the river. When will the bright moon shine on me again?", "dynasty": "Song Dynasty", "poet": "Wang Anshi", "title": "Boancing Guazhou"}
{" content": "Looking around, you can see the light of the mountains and the light of the water, and you can lean on the railing and smell the fragrance of wild flowers. There is no one to care about the clear breeze and the bright moon, and it is always cool as the south building.", "dynasty": "Song Dynasty", "poet": "Huang Tingjian ", "title ": "Ezhou Nanlou Calligraphy"}
{"content": "The green mountains are faint and the water is far away, and the grass in the south of the Yangtze River has not withered after autumn. On the moonlit night of the Twenty-Four Bridge, where can the beauty teach the flute?", "dynasty ": "Tang Dynasty", "poet": "Du Mu", "title": "To Judge Han Chuo of Yangzhou"}
{"content": "The dew air is cold and the light is gathering, and the sun is shining under the Chuqiu. The ape is crying in the cave. Trees, people in the Mulan boat. The bright moon shines in Guangze, and the turbulent currents in the Cangshan Mountains. I don’t see you in the clouds, but I feel sad at night.", "dynasty": "Tang Dynasty", "poet": "马dai", "title ": "One of three nostalgic poems about the Chu River"}
{"content": "The bright moon rises on the sea, and we share this moment at the end of the world. Lovers complain about the distant night, but they miss each other at night. The candles are extinguished and the light is full of pity, and the clothes are covered with dew. Nourishing. I can't bear to give it away, but I still have a good night's sleep.", "dynasty": "Tang Dynasty", "poet": "Zhang Jiuling", "title": "Looking at the Moon and Huaiyuan / Looking at the Moon and Nostalgic for the Past"}

[Related recommendations: Python3 video tutorial]

The above is the detailed content of The use of Python lightweight search tool Whoosh (summary sharing). For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:脚本之家. If there is any infringement, please contact admin@php.cn delete
Python vs. C  : Learning Curves and Ease of UsePython vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python vs. C  : Memory Management and ControlPython vs. C : Memory Management and ControlApr 19, 2025 am 12:17 AM

Python and C have significant differences in memory management and control. 1. Python uses automatic memory management, based on reference counting and garbage collection, simplifying the work of programmers. 2.C requires manual management of memory, providing more control but increasing complexity and error risk. Which language to choose should be based on project requirements and team technology stack.

Python for Scientific Computing: A Detailed LookPython for Scientific Computing: A Detailed LookApr 19, 2025 am 12:15 AM

Python's applications in scientific computing include data analysis, machine learning, numerical simulation and visualization. 1.Numpy provides efficient multi-dimensional arrays and mathematical functions. 2. SciPy extends Numpy functionality and provides optimization and linear algebra tools. 3. Pandas is used for data processing and analysis. 4.Matplotlib is used to generate various graphs and visual results.

Python and C  : Finding the Right ToolPython and C : Finding the Right ToolApr 19, 2025 am 12:04 AM

Whether to choose Python or C depends on project requirements: 1) Python is suitable for rapid development, data science, and scripting because of its concise syntax and rich libraries; 2) C is suitable for scenarios that require high performance and underlying control, such as system programming and game development, because of its compilation and manual memory management.

Python for Data Science and Machine LearningPython for Data Science and Machine LearningApr 19, 2025 am 12:02 AM

Python is widely used in data science and machine learning, mainly relying on its simplicity and a powerful library ecosystem. 1) Pandas is used for data processing and analysis, 2) Numpy provides efficient numerical calculations, and 3) Scikit-learn is used for machine learning model construction and optimization, these libraries make Python an ideal tool for data science and machine learning.

Learning Python: Is 2 Hours of Daily Study Sufficient?Learning Python: Is 2 Hours of Daily Study Sufficient?Apr 18, 2025 am 12:22 AM

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Python for Web Development: Key ApplicationsPython for Web Development: Key ApplicationsApr 18, 2025 am 12:20 AM

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python vs. C  : Exploring Performance and EfficiencyPython vs. C : Exploring Performance and EfficiencyApr 18, 2025 am 12:20 AM

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.