Home  >  Article  >  Backend Development  >  Python implementation of a search engine (Pylucene) example tutorial

Python implementation of a search engine (Pylucene) example tutorial

零下一度
零下一度Original
2017-07-18 10:12:5213639browse
  1. Document, document class. The basic unit of indexing in Pylucene is "Document". A Document may be a web page, an article, or an email. Document is the unit used to build the index and is also the result unit when searching. Proper design of it can provide personalized search services.

  2. Filed, domain class. A Document can contain multiple fields (Field). Filed is a component of Document, just like an article may be composed of multiple Files such as article title, article body, author, publication date, etc.

  3. Treat a page as a Document, which contains three fields: the URL address of the page (url), the title of the page (title), and the main text content of the page (content). For the index storage method, choose to use the SimpleFSDirectory class to save the index to a file. The analyzer chooses CJKAnalyzer that comes with Pylucene. This analyzer has good support for Chinese and is suitable for text processing of Chinese content.

  4. What is a search engine?

Search engine is "a system that collects and organizes network information resources and provides information query services, including three parts: information collection, information sorting and user query." Figure 1 is the general structure of a search engine. The information collection module collects information from the Internet into the network information database (generally using crawlers); then the information sorting module performs word segmentation, stop word removal, weighting and other operations on the collected information. Establish an index table (usually an inverted index) to form an index library; finally, the user query module can identify the user's retrieval needs and provide retrieval services.

Figure 1 General structure of search engine

2. Use python to implement a simple search engine

2.1 Problem Analysis

From Figure 1, a complete search engine architecture starts from collecting information from the Internet. You can use python to write a crawler, which is the strength of python.

Next, the information processing module. Participle? Stop words? Inverted table? what? What is this mess? Don't worry about it, we have the wheel built by our predecessors---Pylucene (a python package version of lucene. Lucene can help developers add search functions to software and systems. Lucene is a set of open source libraries for full-text retrieval and search) . Using Pylucene can simply help us process the collected information, including index creation and search.

Finally, in order to use our search engine on the web page, we use flask, a lightweight web application framework, to make a small web page to obtain search statements and feedback search results.

2.2 Crawler design

Mainly collects the following content: the title of the target web page, the main text content of the target web page, and the URL addresses of other pages that the target web page points to. The workflow of the web crawler is shown in Figure 2. The main data structure of the crawler is the queue. First, the initial seed node enters the queue, then takes out a node from the queue to access, captures the target information on the node page, then puts the URL link of the node page pointing to other pages into the queue, and then takes out the new node from the queue. nodes are accessed until the queue is empty. Through the "first in, first out" feature of the queue, a breadth-first traversal algorithm is implemented to access each page of the site one by one.

                                                                                                                                      Filed.

Directory is a class for file operations in Pylucene. It has 11 subclasses such as SimpleFSDirectory, RAMDirectory, CompoundFileDirectory, and FileSwitchDirectory. The four listed are subclasses related to saving the index directory. SimpleFSDirectory saves the built index to the file system; RAMDirectory saves the index to RAM memory. Among them; CompoundFileDirectory is a compound index saving method; and FileSwitchDirectory allows temporary switching of the index saving method to take advantage of various index saving methods.

Analyzer, analyzer. It is a class that processes the text obtained by the crawler to be indexed. Including operations such as word segmentation of text, removal of stop words, and case conversion. Pylucene comes with several analyzers, and you can also use third-party analyzers or self-written analyzers when building indexes. The quality of the analyzer is related to the quality of index construction and the accuracy and speed that the search service can provide.

IndexWriter, index writing class. In the storage space opened by Directory, IndexWriter can perform operations such as writing, modifying, adding, and deleting indexes, but it cannot read or search the index.

Document, document class. The basic unit of indexing in Pylucene is "Document". A Document may be a web page, an article, or an email. Document is the unit used to build the index and is also the result unit when searching. Proper design of it can provide personalized search services.

Filed, domain class. A Document can contain multiple fields (Field). Filed is a component of Document, just like an article may be composed of multiple Files such as article title, article body, author, publication date, etc.

Treat a page as a Document, which contains three fields: the URL address of the page (url), the title of the page (title), and the main text content of the page (content). For the index storage method, choose to use the SimpleFSDirectory class to save the index to a file. The analyzer chooses CJKAnalyzer that comes with Pylucene. This analyzer has good support for Chinese and is suitable for text processing of Chinese content.

The specific steps for using Pylucene to build an index are as follows:

lucene.initVM()
INDEXIDR = self.__index_dir
indexdir = SimpleFSDirectory(File(INDEXIDR))①
analyzer = CJKAnalyzer(Version.LUCENE_30)②
index_writer = IndexWriter(indexdir, analyzer, True, IndexWriter.MaxFieldLength(512))③
document = Document()④
document.add(Field("content", str(page_info["content"]), Field.Store.NOT, Field.Index.ANALYZED))⑤
document.add(Field("url", visiting, Field.Store.YES, Field.Index.NOT_ANALYZED))⑥
document.add(Field("title", str(page_info["title"]), Field.Store.YES, Field.Index.ANALYZED))⑦
index_writer.addDocument(document)⑧
index_writer.optimize()⑨
index_writer.close()⑩

There are 10 main steps to build an index:

① Instantiate a SimpleFSDirectory object and save the index to a local file. The saved path is the customized path "INDEXIDR".

② Instantiate a CJKAnalyzer analyzer. The parameter Version.LUCENE_30 during instantiation is the version number of Pylucene.

③ Instantiate an IndexWriter object. The four parameters carried are the previously instantiated SimpleFSDirectory object and the CJKAnalyzer analyzer. The Boolean variable true indicates the creation of a new index. IndexWriter.MaxFieldLength specifies The maximum number of fields (Filed) in an index.

④ Instantiate a Document object and name it document.

⑤Add a domain named "content" to the document. The content of this field is the main text content of a web page obtained by the crawler. The parameter of this operation is the Field object that is instantiated and used immediately; the four parameters of the Field object are:

(1) "content", the name of the domain.

(2) page_info["content"], the main text content of the web page collected by the crawler.

(3) Field.Store is a variable used to indicate whether the value of this field can be restored to the original characters. Field.Store.YES indicates that the content stored in this field can be restored to the original text content. Field. Store.NOT means it is not recoverable.

(4) Field.Index variable indicates whether the content of the field should be processed by the analyzer. Field. Index.ANALYZED indicates that the analyzer is used for character processing in the field. Field. Index. NOT_ANALYZED indicates that the analyzer is not used for the field. The parser processes characters.

⑥Add a domain named "url" to save the page address.

⑦Add a field named "title" to save the title of the page.

⑧Instantiate the IndexWriter object to write the document document to the index file.

⑨Optimize the index library files and merge small files in the index library into large files.

⑩Close the IndexWriter object after the index building operation is completed in a single cycle.

Pylucene’s main classes for index search include IndexSearcher, Query, and QueryParser[16].

IndexSearcher, index search class. Used to perform search operations in the index library built by IndexWriter.

Query, the class that describes the query request. It submits the query request to IndexSearcher to complete the search operation. Query has many subclasses to complete different query requests. For example, TermQuery searches by term, which is the most basic and simple query type, and is used to match documents with specific items in a specified domain; RangeQuery, searches within a specified range, is used to match documents within a specific range in a specified domain; FuzzyQuery, a fuzzy query, can simply identify synonym matches that are semantically similar to the query keyword.

QueryParser, Query parser. When you need to implement different query requirements, you must use different subclasses provided by Query, which makes it easy to cause confusion when using Query. Therefore, Pylucene also provides Query parser QueryParser. QueryParser can parse the submitted Query statement and select the appropriate Query subclass according to the Query syntax to complete the corresponding query. Developers do not need to care about what Query implementation class is used at the bottom. For example, the Query statement "keyword 1 and keyword 2" QueryParser parses to query documents that match both keyword 1 and keyword 2; the Query statement "id[123 to 456]" QueryParser parses to query the domain whose name is "id" Documents whose value is in the specified range "123" to "456"; Query statement "keyword site:www.web.com" QueryParser parses into a query that also satisfies the value of "www.web" in the domain named "site" .com" and documents matching the two query conditions of "keyword".

Index search is one of the areas that Pylucene focuses on. A class named query is written to implement index search. Query implements index search and has the following main steps:

lucene.initVM()
if query_str.find(":") ==-1 and query_str.find(":") ==-1:
query_str="title:"+query_str+" OR content:"+query_str①
indir= SimpleFSDirectory(File(self.__indexDir))②
lucene_analyzer= CJKAnalyzer(Version.LUCENE_CURRENT)③
lucene_searcher= IndexSearcher(indir)④
my_query = QueryParser(Version.LUCENE_CURRENT,"title",lucene_analyzer).parse(query_str)⑤
total_hits = lucene_searcher.search(my_query, MAX)⑥
for hit in total_hits.scoreDocs:⑦
            print"Hit Score: ", hit.score
            doc = lucene_searcher.doc(hit.doc)
            result_urls.append(doc.get("url").encode("utf-8"))
            result_titles.append(doc.get("title").encode("utf-8"))
            print doc.get("title").encode("utf-8")
 
 result = {"Hits": total_hits.totalHits, "url":tuple(result_urls), "title":tuple(result_titles)}
 return result

Index search has 7 main steps:

① First, judge the search statement. If the statement is not a single domain query for the title or article content, that is, it does not contain the keyword "title" :" or "content:", the title and content fields are searched by default.

②Instantiate a SimpleFSDirectory object and specify its working path as the path where the index was previously created.

③实例化一个CJKAnalyzer分析器,搜索时使用的分析器应与索引构建时使用的分析器在类型版本上均一致。

④实例化一个IndexSearcher对象lucene_searcher,它的参数为第○2步的SimpleFSDirectory对象。

⑤实例化一个QueryParser对象my_query,它描述查询请求,解析Query查询语句。参数Version.LUCENE_CURRENT为pylucene的版本号,“title”指默认的搜索域,lucene_analyzer指定了使用的分析器,query_str是Query查询语句。在实例化QueryParser前会对用户搜索请求作简单处理,若用户指定了搜索某个域就搜索该域,若用户未指定则同时搜索“title”和“content”两个域。

⑥lucene_searcher进行搜索操作,返回结果集total_hits。total_hits中包含结果总数totalHits,搜索结果的文档集scoreDocs,scoreDocs中包括搜索出的文档以及每篇文档与搜索语句相关度的得分。

⑦lucene_searcher搜索出的结果集不能直接被Python处理,因而在搜索操作返回结果之前应将结果由Pylucene转为普通的Python数据结构。使用For循环依次处理每个结果,将结果文档按相关度得分高低依次将它们的地址域“url”的值放入Python列表result_urls,将标题域“title”的值放入列表result_titles。最后将包含地址、标题的列表和结果总数组合成一个Python“字典”,将最后处理的结果作为整个搜索操作的返回值。

 

用户在浏览器搜索框输入搜索词并点击搜索,浏览器发起一个GET请求,Flask的路由route设置了由result函数响应该请求。result函数先实例化一个搜索类query的对象infoso,将搜索词传递给该对象,infoso完成搜索将结果返回给函数result。函数result将搜索出来的页面和结果总数等传递给模板result.html,模板result.html用于呈现结果

如下是Python使用flask模块处理搜索请求的代码:

 

app = Flask(__name__)#创建Flask实例

@app.route('/')#设置搜索默认主页

def index():
html="<h1>title这是标题</h1>"
return render_template(&#39;index.html&#39;)
@app.route("/result",methods=[&#39;GET&#39;, &#39;POST&#39;])#注册路由,并指定HTTP方法为GET、POST
def result(): #resul函数
if request.method=="GET":#响应GET请求
key_word=request.args.get(&#39;word&#39;)#获取搜索语句
   if len(key_word)!=0:
      infoso = query("./glxy") #创建查询类query的实例
       re = infoso.search(key_word)#进行搜索,返回结果集
       so_result=[]
       n=0
       for item in re["url"]:
temp_result={"url":item,"title":re["title"][n]}#将结果集传递给模板
        so_result.append(temp_result)
                n=n+1
        return render_template(&#39;result.html&#39;, key_word=key_word, result_sum=re["Hits"],result=so_result)
    else:
        key_word=""
    return render_template(&#39;result.html&#39;)
if __name__ == &#39;__main__&#39;:
    app.debug = True
    app.run()#运行web服务


The above is the detailed content of Python implementation of a search engine (Pylucene) example tutorial. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn