search
HomeBackend DevelopmentPython TutorialTeach you step by step how to use Flask to build an ES search engine (preparatory part)

/1 Introduction/

## Elasticsearch is an open source search engine built on a full-text search engine library Apache Lucene™ Based on.


Teach you step by step how to use Flask to build an ES search engine (preparatory part)

# So how to realize the connection between

Elasticsearch and Python has become our concern The problem is (why should everything be related to Python?).

##/2 Python Interaction/

So, Python also provides a way to connect Elasticsearch's dependent libraries.


pip install elasticsearch


Initialize a connection

Elasticsearch Operation object.

def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"):

    # self.es = Elasticsearch([ip], http_auth=('username', 'password'), port=9200)
    self.es = Elasticsearch("localhost:9200")
    self.index_type = index_type
    self.index_name = index_name
Default port

9200, please ensure that the local environment of Elasticsearch has been set up before initialization.

Get document data based on ID


##

def get_doc(self, uid):
    return self.es.get(index=self.index_name, id=uid)

插入文档数据


def insert_one(self, doc: dict):
    self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

def insert_array(self, docs: list):
    for doc in docs:
        self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)


搜索文档数据


def search(self, query, count: int = 30):
    dsl = {
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title", "content", "link"]
            }
        },
        "highlight": {
            "fields": {
                "title": {}
            }
        }
    }
    match_data = self.es.search(index=self.index_name, body=dsl, size=count)
    return match_data

def __search(self, query: dict, count: int = 20): # count: 返回的数据大小
    results = []
    params = {
        'size': count
    }
    match_data = self.es.search(index=self.index_name, body=query, params=params)
    for hit in match_data['hits']['hits']:
        results.append(hit['_source'])

    return results

删除文档数据


def delete_index(self):
    try:
        self.es.indices.delete(index=self.index_name)
    except:
        pass

好啊,封装 search 类也是为了方便调用,整体贴一下。

from elasticsearch import Elasticsearch


class elasticSearch():

    def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"):

        # self.es = Elasticsearch([ip], http_auth=('elastic', 'password'), port=9200)
        self.es = Elasticsearch("localhost:9200")
        self.index_type = index_type
        self.index_name = index_name

    def create_index(self):
        if self.es.indices.exists(index=self.index_name) is True:
            self.es.indices.delete(index=self.index_name)
        self.es.indices.create(index=self.index_name, ignore=400)

    def delete_index(self):
        try:
            self.es.indices.delete(index=self.index_name)
        except:
            pass

    def get_doc(self, uid):
        return self.es.get(index=self.index_name, id=uid)

    def insert_one(self, doc: dict):
        self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

    def insert_array(self, docs: list):
        for doc in docs:
            self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

    def search(self, query, count: int = 30):
        dsl = {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["title", "content", "link"]
                }
            },
            "highlight": {
                "fields": {
                    "title": {}
                }
            }
        }
        match_data = self.es.search(index=self.index_name, body=dsl, size=count)
        return match_data

尝试一下把 Mongodb 中的数据插入到 ES 中。

import json
from datetime import datetime
import pymongo
from app.elasticsearchClass import elasticSearch

client = pymongo.MongoClient('127.0.0.1', 27017)
db = client['spider']
sheet = db.get_collection('Spider').find({}, {'_id': 0, })

es = elasticSearch(index_type="spider_data",index_name="spider")
es.create_index()

for i in sheet:
    data = {
            'title': i["title"],
            'content':i["data"],
            'link': i["link"],
            'create_time':datetime.now()
        }

    es.insert_one(doc=data)

到 ES 中查看一下,启动 elasticsearch-head 插件。

如果是 npm 安装的那么 cd 到根目录之后直接 npm run start 就跑起来了。

本地访问  http://localhost:9100/

Teach you step by step how to use Flask to build an ES search engine (preparatory part)

发现新加的 spider 数据文档确实已经进去了。

/3 爬虫入库/

要想实现 ES 搜索,首先要有数据支持,而海量的数据往往来自爬虫。

为了节省时间,编写一个最简单的爬虫,抓取 百度百科

简单粗暴一点,先 递归获取 很多很多的 url 链接


import requests
import re
import time

exist_urls = []
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36',
}

def get_link(url):
    try:
        response = requests.get(url=url, headers=headers)
        response.encoding = 'UTF-8'
        html = response.text
        link_lists = re.findall(&#39;.*?<a target=_blank href="/item/([^:#=<>]*?)".*?</a>&#39;, html)
        return link_lists
    except Exception as e:
        pass
    finally:
        exist_urls.append(url)


# 当爬取深度小于10层时,递归调用主函数,继续爬取第二层的所有链接
def main(start_url, depth=1):
    link_lists = get_link(start_url)
    if link_lists:
        unique_lists = list(set(link_lists) - set(exist_urls))
        for unique_url in unique_lists:
            unique_url = &#39;https://baike.baidu.com/item/&#39; + unique_url

            with open(&#39;url.txt&#39;, &#39;a+&#39;) as f:
                f.write(unique_url + &#39;\n&#39;)
                f.close()
        if depth < 10:
            main(unique_url, depth + 1)

if __name__ == &#39;__main__&#39;:
    start_url = &#39;https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91&#39;
    main(start_url)


把全部 url 存到 url.txt 文件中之后,然后启动任务。


# parse.py
from celery import Celery
import requests
from lxml import etree
import pymongo
app = Celery(&#39;tasks&#39;, broker=&#39;redis://localhost:6379/2&#39;)
client = pymongo.MongoClient(&#39;localhost&#39;,27017)
db = client[&#39;baike&#39;]
@app.task
def get_url(link):
    item = {}
    headers = {&#39;User-Agent&#39;:&#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36&#39;}
    res = requests.get(link,headers=headers)
    res.encoding = &#39;UTF-8&#39;
    doc = etree.HTML(res.text)
    content = doc.xpath("//div[@class=&#39;lemma-summary&#39;]/div[@class=&#39;para&#39;]//text()")
    print(res.status_code)
    print(link,&#39;\t&#39;,&#39;++++++++++++++++++++&#39;)
    item[&#39;link&#39;] = link
    data = &#39;&#39;.join(content).replace(&#39; &#39;, &#39;&#39;).replace(&#39;\t&#39;, &#39;&#39;).replace(&#39;\n&#39;, &#39;&#39;).replace(&#39;\r&#39;, &#39;&#39;)
    item[&#39;data&#39;] = data
    if db[&#39;Baike&#39;].insert(dict(item)):
        print("is OK ...")
    else:
        print(&#39;Fail&#39;)

run.py 飞起来


from parse import get_url

def main(url):
    result = get_url.delay(url)
    return result

def run():
    with open(&#39;./url.txt&#39;, &#39;r&#39;) as f:
        for url in f.readlines():
            main(url.strip(&#39;\n&#39;))

if __name__ == &#39;__main__&#39;:
    run()


黑窗口键入


celery -A parse worker -l info -P gevent -c 10

哦豁 !!   你居然使用了 Celery 任务队列,gevent 模式,-c 就是10个线程刷刷刷就干起来了,速度杠杠的 !!

啥?分布式? 那就加多几台机器啦,直接把代码拷贝到目标服务器,通过 redis 共享队列协同多机抓取。

这里是先将数据存储到了 MongoDB 上(个人习惯),你也可以直接存到 ES 中,但是单条单条的插入速度堪忧(接下来会讲到优化,哈哈)。

使用前面的例子将 Mongo 中的数据批量导入到 ES 中,OK !!!

Teach you step by step how to use Flask to build an ES search engine (preparatory part)

到这一个简单的数据抓取就已经完毕了。

好啦,现在 ES 中已经有了数据啦,接下来就应该是 Flask web 的操作啦,当然,DjangoFastAPI 也很优秀。嘿嘿,你喜欢 !!

关于FastAPI 的文章可以看这个系列文章:

1、(入门篇)简析Python web框架FastAPI——一个比Flask和Tornada更高性能的API 框架

2、(进阶篇)Python web框架FastAPI——一个比Flask和Tornada更高性能的API 框架

3、(完结篇)Python web框架FastAPI——一个比Flask和Tornada更高性能的API 框架

/4 Flask 项目结构/

Teach you step by step how to use Flask to build an ES search engine (preparatory part)


这样一来前期工作就差不多了,接下来剩下的工作主要集中于 Flask 的实际开发中,蓄力中 !!

The above is the detailed content of Teach you step by step how to use Flask to build an ES search engine (preparatory part). For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:Go语言进阶学习. If there is any infringement, please contact admin@php.cn delete
Merging Lists in Python: Choosing the Right MethodMerging Lists in Python: Choosing the Right MethodMay 14, 2025 am 12:11 AM

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

How to concatenate two lists in python 3?How to concatenate two lists in python 3?May 14, 2025 am 12:09 AM

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Python concatenate list stringsPython concatenate list stringsMay 14, 2025 am 12:08 AM

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

Python execution, what is that?Python execution, what is that?May 14, 2025 am 12:06 AM

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Python: what are the key featuresPython: what are the key featuresMay 14, 2025 am 12:02 AM

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python: compiler or Interpreter?Python: compiler or Interpreter?May 13, 2025 am 12:10 AM

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Python For Loop vs While Loop: When to Use Which?Python For Loop vs While Loop: When to Use Which?May 13, 2025 am 12:07 AM

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Python loops: The most common errorsPython loops: The most common errorsMay 13, 2025 am 12:07 AM

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools