手把手教你使用Flask搭建ES搜索引擎(预备篇)-Python教程-PHP中文网

首页

后端开发

Python教程

手把手教你使用Flask搭建ES搜索引擎(预备篇)

Go语言进阶学习

Jul 25, 2023 pm 05:27 PM

flask

/1 前言/

Elasticsearch 是一个开源的搜索引擎，建立在一个全文搜索引擎库 Apache Lucene™ 基础之上。

手把手教你使用Flask搭建ES搜索引擎(预备篇)

那么如何实现 Elasticsearch和 Python 的对接成为我们所关心的问题了 (怎么什么都要和 Python 关联啊)。

/2 Python 交互/

所以，Python 也就提供了可以对接 Elasticsearch的依赖库。

pip install elasticsearch

初始化连接一个 Elasticsearch 操作对象。

def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"):

    # self.es = Elasticsearch([ip], http_auth=(&#39;username&#39;, &#39;password&#39;), port=9200)
    self.es = Elasticsearch("localhost:9200")
    self.index_type = index_type
    self.index_name = index_name

默认端口 9200，初始化前请确保本地已搭建好 Elasticsearch的所属环境。

根据 ID 获取文档数据

def get_doc(self, uid):
    return self.es.get(index=self.index_name, id=uid)

插入文档数据

def insert_one(self, doc: dict):
    self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

def insert_array(self, docs: list):
    for doc in docs:
        self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

搜索文档数据

def search(self, query, count: int = 30):
    dsl = {
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title", "content", "link"]
            }
        },
        "highlight": {
            "fields": {
                "title": {}
            }
        }
    }
    match_data = self.es.search(index=self.index_name, body=dsl, size=count)
    return match_data

def __search(self, query: dict, count: int = 20): # count: 返回的数据大小
    results = []
    params = {
        &#39;size&#39;: count
    }
    match_data = self.es.search(index=self.index_name, body=query, params=params)
    for hit in match_data[&#39;hits&#39;][&#39;hits&#39;]:
        results.append(hit[&#39;_source&#39;])

    return results

删除文档数据

def delete_index(self):
    try:
        self.es.indices.delete(index=self.index_name)
    except:
        pass

好啊，封装 search 类也是为了方便调用，整体贴一下。

from elasticsearch import Elasticsearch


class elasticSearch():

    def __init__(self, index_type: str, index_name: str, ip="127.0.0.1"):

        # self.es = Elasticsearch([ip], http_auth=(&#39;elastic&#39;, &#39;password&#39;), port=9200)
        self.es = Elasticsearch("localhost:9200")
        self.index_type = index_type
        self.index_name = index_name

    def create_index(self):
        if self.es.indices.exists(index=self.index_name) is True:
            self.es.indices.delete(index=self.index_name)
        self.es.indices.create(index=self.index_name, ignore=400)

    def delete_index(self):
        try:
            self.es.indices.delete(index=self.index_name)
        except:
            pass

    def get_doc(self, uid):
        return self.es.get(index=self.index_name, id=uid)

    def insert_one(self, doc: dict):
        self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

    def insert_array(self, docs: list):
        for doc in docs:
            self.es.index(index=self.index_name, doc_type=self.index_type, body=doc)

    def search(self, query, count: int = 30):
        dsl = {
            "query": {
                "multi_match": {
                    "query": query,
                    "fields": ["title", "content", "link"]
                }
            },
            "highlight": {
                "fields": {
                    "title": {}
                }
            }
        }
        match_data = self.es.search(index=self.index_name, body=dsl, size=count)
        return match_data

尝试一下把 Mongodb 中的数据插入到 ES 中。

import json
from datetime import datetime
import pymongo
from app.elasticsearchClass import elasticSearch

client = pymongo.MongoClient(&#39;127.0.0.1&#39;, 27017)
db = client[&#39;spider&#39;]
sheet = db.get_collection(&#39;Spider&#39;).find({}, {&#39;_id&#39;: 0, })

es = elasticSearch(index_type="spider_data",index_name="spider")
es.create_index()

for i in sheet:
    data = {
            &#39;title&#39;: i["title"],
            &#39;content&#39;:i["data"],
            &#39;link&#39;: i["link"],
            &#39;create_time&#39;:datetime.now()
        }

    es.insert_one(doc=data)

到 ES 中查看一下，启动 elasticsearch-head 插件。

如果是 npm 安装的那么 cd 到根目录之后直接 npm run start 就跑起来了。

本地访问 http://localhost:9100/

手把手教你使用Flask搭建ES搜索引擎(预备篇)

发现新加的 spider 数据文档确实已经进去了。

/3 爬虫入库/

要想实现 ES 搜索，首先要有数据支持，而海量的数据往往来自爬虫。

为了节省时间，编写一个最简单的爬虫，抓取百度百科。

简单粗暴一点，先递归获取很多很多的 url 链接

import requests
import re
import time

exist_urls = []
headers = {
    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36&#39;,
}

def get_link(url):
    try:
        response = requests.get(url=url, headers=headers)
        response.encoding = &#39;UTF-8&#39;
        html = response.text
        link_lists = re.findall(&#39;.*?<a target=_blank href="/item/([^:#=<>]*?)".*?</a>&#39;, html)
        return link_lists
    except Exception as e:
        pass
    finally:
        exist_urls.append(url)


# 当爬取深度小于10层时，递归调用主函数，继续爬取第二层的所有链接
def main(start_url, depth=1):
    link_lists = get_link(start_url)
    if link_lists:
        unique_lists = list(set(link_lists) - set(exist_urls))
        for unique_url in unique_lists:
            unique_url = &#39;https://baike.baidu.com/item/&#39; + unique_url

            with open(&#39;url.txt&#39;, &#39;a+&#39;) as f:
                f.write(unique_url + &#39;\n&#39;)
                f.close()
        if depth < 10:
            main(unique_url, depth + 1)

if __name__ == &#39;__main__&#39;:
    start_url = &#39;https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91&#39;
    main(start_url)

把全部 url 存到 url.txt 文件中之后，然后启动任务。

# parse.py
from celery import Celery
import requests
from lxml import etree
import pymongo
app = Celery(&#39;tasks&#39;, broker=&#39;redis://localhost:6379/2&#39;)
client = pymongo.MongoClient(&#39;localhost&#39;,27017)
db = client[&#39;baike&#39;]
@app.task
def get_url(link):
    item = {}
    headers = {&#39;User-Agent&#39;:&#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36&#39;}
    res = requests.get(link,headers=headers)
    res.encoding = &#39;UTF-8&#39;
    doc = etree.HTML(res.text)
    content = doc.xpath("//div[@class=&#39;lemma-summary&#39;]/div[@class=&#39;para&#39;]//text()")
    print(res.status_code)
    print(link,&#39;\t&#39;,&#39;++++++++++++++++++++&#39;)
    item[&#39;link&#39;] = link
    data = &#39;&#39;.join(content).replace(&#39; &#39;, &#39;&#39;).replace(&#39;\t&#39;, &#39;&#39;).replace(&#39;\n&#39;, &#39;&#39;).replace(&#39;\r&#39;, &#39;&#39;)
    item[&#39;data&#39;] = data
    if db[&#39;Baike&#39;].insert(dict(item)):
        print("is OK ...")
    else:
        print(&#39;Fail&#39;)

run.py 飞起来

from parse import get_url

def main(url):
    result = get_url.delay(url)
    return result

def run():
    with open(&#39;./url.txt&#39;, &#39;r&#39;) as f:
        for url in f.readlines():
            main(url.strip(&#39;\n&#39;))

if __name__ == &#39;__main__&#39;:
    run()

黑窗口键入

celery -A parse worker -l info -P gevent -c 10

哦豁 !! 你居然使用了 Celery 任务队列，gevent 模式，-c 就是10个线程刷刷刷就干起来了，速度杠杠的！！

啥？分布式？那就加多几台机器啦，直接把代码拷贝到目标服务器，通过 redis 共享队列协同多机抓取。

这里是先将数据存储到了 MongoDB 上(个人习惯)，你也可以直接存到 ES 中，但是单条单条的插入速度堪忧(接下来会讲到优化，哈哈)。

使用前面的例子将 Mongo 中的数据批量导入到 ES 中，OK !!!

手把手教你使用Flask搭建ES搜索引擎(预备篇)

到这一个简单的数据抓取就已经完毕了。

好啦，现在 ES 中已经有了数据啦，接下来就应该是 Flask web 的操作啦，当然，Django，FastAPI 也很优秀。嘿嘿，你喜欢！！

关于FastAPI 的文章可以看这个系列文章：

1、（入门篇）简析Python web框架FastAPI——一个比Flask和Tornada更高性能的API 框架

2、（进阶篇）Python web框架FastAPI——一个比Flask和Tornada更高性能的API 框架

3、（完结篇）Python web框架FastAPI——一个比Flask和Tornada更高性能的API 框架

/4 Flask 项目结构/

手把手教你使用Flask搭建ES搜索引擎(预备篇)

这样一来前期工作就差不多了，接下来剩下的工作主要集中于 Flask 的实际开发中，蓄力中！！

以上是手把手教你使用Flask搭建ES搜索引擎(预备篇)的详细内容。更多信息请关注PHP中文网其他相关文章！

声明

本文转载于：Go语言进阶学习。如有侵权，请联系admin@php.cn删除

您如何将元素附加到Python数组？Apr 30, 2025 am 12:19 AM

Inpython，YouAppendElementStoAlistusingTheAppend（）方法。1）useappend（）forsingleelements：my_list.append（4）.2）useextend（）orextend（）或= formultiplelements：my_list.extend.extend（emote_list）ormy_list = [4,5,6] .3）useInsert（）forspefificpositions：my_list.insert（1,5）.beaware

您如何调试与Shebang有关的问题？Apr 30, 2025 am 12:17 AM

调试shebang问题的方法包括：1.检查shebang行确保是脚本首行且无前置空格；2.验证解释器路径是否正确；3.直接调用解释器运行脚本以隔离shebang问题；4.使用strace或truss跟踪系统调用；5.检查环境变量对shebang的影响。

如何从python数组中删除元素？Apr 30, 2025 am 12:16 AM

pythonlistscanbemanipulationusesseveralmethodstoremovelements：1）theremove（）MethodRemovestHefirStocCurrenceOfAstePecifiedValue.2）thepop（）thepop（）methodremovesandremovesandurturnturnsananelementatagivenIndex.3）

可以在Python列表中存储哪些数据类型？Apr 30, 2025 am 12:07 AM

pythonlistscanstoreanydatate型，包括素，弦，浮子，布尔人，其他列表和迪克尼亚式

在Python列表上可以执行哪些常见操作？Apr 30, 2025 am 12:01 AM

pythristssupportnumereperations：1）addingElementSwithAppend（），Extend（），andInsert（）。2）emovingItemSusingRemove（），pop（），andclear（），and clear（）。3）访问andmodifyingandmodifyingwithIndexingAndexingAndSlicing.4）

如何使用numpy创建多维数组？Apr 29, 2025 am 12:27 AM

使用NumPy创建多维数组可以通过以下步骤实现：1)使用numpy.array()函数创建数组，例如np.array([[1,2,3],[4,5,6]])创建2D数组；2)使用np.zeros(),np.ones(),np.random.random()等函数创建特定值填充的数组；3)理解数组的shape和size属性，确保子数组长度一致，避免错误；4)使用np.reshape()函数改变数组形状；5)注意内存使用，确保代码清晰高效。

说明Numpy阵列中'广播”的概念。Apr 29, 2025 am 12:23 AM

播放innumpyisamethodtoperformoperationsonArraySofDifferentsHapesbyAutapityallate AligningThem.itSimplifififiesCode，增强可读性，和Boostsperformance.Shere'shore'showitworks：1）较小的ArraySaraySaraysAraySaraySaraySaraySarePaddedDedWiteWithOnestOmatchDimentions.2）

说明如何在列表，Array.Array和用于数据存储的Numpy数组之间进行选择。Apr 29, 2025 am 12:20 AM

forpythondataTastorage，choselistsforflexibilityWithMixedDatatypes，array.ArrayFormeMory-effficityHomogeneousnumericalData，andnumpyArraysForAdvancedNumericalComputing.listsareversareversareversareversArversatilebutlessEbutlesseftlesseftlesseftlessforefforefforefforefforefforefforefforefforefforlargenumerdataSets; arrayoffray.array.array.array.array.array.ersersamiddreddregro

See all articles