Home > Article > Web Front-end > How to make a simple search engine
Sometimes because of work and our own needs, we will browse different websites to obtain the data we need, so crawlers come into being. The following is my process of developing a simple crawler and the problems I encountered.
Last time, Xiaobai had got a hard-working web crawler. It seemed that he would be sorry if he didn’t cause a series of small things, so Xiaobai started tinkering with information from various experts. Inverted index is a simple search engine designed based on basic principles.
The previous crawler only got the source code of the web page without doing any processing. It was a one-time small crawler, so Xiaobai used regular expressions to match the content of the web page to get the URL, and then the small crawler We can use this to help us crawl web pages until death. I have to mention beautifulsoup and regular expressions here. It is said that the beautifulsoup module is a powerful tool for web crawling and extraction. It is a pity that Xiaobao is not finished. I heard the name later and regretted not being able to try it out. However, Xiaobai has personally studied the regular expressions. Once he is proficient (forced proficiency), it is also very easy to use. For example, the URL for extracting the source code of the web page:
link_list = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')", html)
这一句就提出个七七八八来,当然这么粗糙鱼目混珠的情况也是少不了的,但是还是非常好用的, 虽然看起来很复杂但是只要掌握了(?52e6299321e87de75a2b0d13a7b81305be337b902aa7e2394219c5df34c1adf3之中,title和链接什么的也有对应的标签, 运用正则表单式理论上可以分离出来,不过小白亲身时间发现只匹配一次效果非常不好, 匹配的内容的确包括想要的内容,但是因为标签一般都是嵌套的嘛而且小白技术毕竟也不好正则表达式可能表述的也有问题, 所以总是会将内容嵌套在标签中返回,这里就有一个比较笨的方法供大家参考,咳咳, 既然一次不能得到,那么就对内容进行再匹配,咳咳,经过了三层匹配外加一些小技巧终于是勉强匹配出来了, 这里代码过于丑陋就不再贴出来了咳咳。
Related recommendations:
Scrapy crawler introductory tutorial four Spider (crawler)
php realizes the development of simple crawlers, PHP implements crawler
The above is the detailed content of How to make a simple search engine. For more information, please follow other related articles on the PHP Chinese website!