Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology-Python Tutorial-php.cn

Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology

高洛峰

Mar 20, 2017 am 09:25 AM

python crawler

It was my first time to learn crawler technology. I read a joke on Zhihu about how to crawl to the Encyclopedia of Embarrassing Things, so I decided to make one myself.

Achieve goals: 1. Crawling to the jokes in the Encyclopedia of Embarrassing Things

2. Crawling one paragraph every time and crawling to the next page every time you press Enter

Technical implementation: Based on the implementation of python, using the Requests library, re library, and the BeautifulSoup method of the bs4 library to implement

Main content: First, we need to clarify the ideas for crawling implementation , let’s build the main framework. In the first step, we first write a method to obtain web pages using the Requests library. In the second step, we use the BeautifulSoup method of the bs4 library to analyze the obtained web page information and use regular expressions to match relevant paragraph information. . The third step is to print out the obtained information. We all execute the above methods through a main function .

First, import the relevant libraries

import requests
from bs4 import BeautifulSoup
import bs4
import  re

Second, first obtain the web page information

def getHTMLText(url):
    try:
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

Third, put the information into r and then analyze it

soup = BeautifulSoup(html,"html.parser")

What we need is the content and publisher of the joke. By viewing the source code on the web page, we know that the publisher of the joke is:

'p', attrs={'class': 'content'}中

The content of the joke is in

'p', attrs={'class': 'author clearfix'}中

, so we pass bs4 Library method to extract the specific content of these two tags

def fillUnivlist(lis,li,html,count):
    soup = BeautifulSoup(html,"html.parser")
    try:
        a = soup.find_all('p', attrs={'class': 'content'})
        ll = soup.find_all('p', attrs={'class': 'author clearfix'})

Then obtain the information through specific regular expressions

for sp in a:
    patten = re.compile(r'<span>(.*?)</span>',re.S)
    Info = re.findall(patten,str(sp))
    lis.append(Info)
    count = count + 1
for mc in ll:
    namePatten = re.compile(r'<h2 id="">(.*?)</h2>', re.S)
    d = re.findall(namePatten, str(mc))
    li.append(d)

What we need to pay attention to is the return of find_all and re’s findall method They are all a list. When using regular expressions, we only roughly extract and do not remove the line breaks in the tags

Next, we only need to combine the contents of the two lists and output them

def printUnivlist(lis,li,count):
    for i in range(count):
        a = li[i][0]
        b = lis[i][0]
        print ("%s:"%a+"%s"%b)

Then I make an input control function, enter Q to return an error, exit, enter Enter to return correct, and load the next page of paragraphs

def input_enter():
    input1 = input()
    if input1 == 'Q':
        return False
    else:
        return True

We realize the input control through the main function. If If the control function returns an error, the output will not be performed. If the return value is correct, the output will continue. We load the next page through a for loop.

def main():
    passage = 0
    enable = True
    for i in range(20):
        mc = input_enter()
        if mc==True:
            lit = []
            li = []
            count = 0
            passage = passage + 1
            qbpassage = passage
            print(qbpassage)
            url = 'http://www.qiushibaike.com/8hr/page/' + str(qbpassage) + '/?s=4966318'
            a = getHTMLText(url)
            fillUnivlist(lit, li, a, count)
            number = fillUnivlist(lit, li, a, count)
            printUnivlist(lit, li, number)
        else:
            break

Here we need to note that every for loop will refresh lis[] and li[], so that the paragraph content of the webpage can be correctly output every time

Here is the source code :

import requests
from bs4 import BeautifulSoup
import bs4
import  re
def getHTMLText(url):
    try:
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
def fillUnivlist(lis,li,html,count):
    soup = BeautifulSoup(html,"html.parser")
    try:
        a = soup.find_all('p', attrs={'class': 'content'})
        ll = soup.find_all('p', attrs={'class': 'author clearfix'})
        for sp in a:
            patten = re.compile(r'(.*?)',re.S)
            Info = re.findall(patten,str(sp))
            lis.append(Info)
            count = count + 1
        for mc in ll:
            namePatten = re.compile(r'(.*?)', re.S)
            d = re.findall(namePatten, str(mc))
            li.append(d)
    except:
        return ""
    return count
def printUnivlist(lis,li,count):
    for i in range(count):
        a = li[i][0]
        b = lis[i][0]
        print ("%s:"%a+"%s"%b)
def input_enter():
    input1 = input()
    if input1 == 'Q':
        return False
    else:
        return True
def main():
    passage = 0
    enable = True
    for i in range(20):
        mc = input_enter()
        if mc==True:
            lit = []
            li = []
            count = 0
            passage = passage + 1
            qbpassage = passage
            print(qbpassage)
            url = 'http://www.qiushibaike.com/8hr/page/' + str(qbpassage) + '/?s=4966318'
            a = getHTMLText(url)
            fillUnivlist(lit, li, a, count)
            number = fillUnivlist(lit, li, a, count)
            printUnivlist(lit, li, number)
        else:
            break
main()

This is my first time doing it and there are still many areas that can be optimized. I hope everyone can point it out.

The above is the detailed content of Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What is Python Switch Statement?Apr 30, 2025 pm 02:08 PM

The article discusses Python's new "match" statement introduced in version 3.10, which serves as an equivalent to switch statements in other languages. It enhances code readability and offers performance benefits over traditional if-elif-el

What are Exception Groups in Python?Apr 30, 2025 pm 02:07 PM

Exception Groups in Python 3.11 allow handling multiple exceptions simultaneously, improving error management in concurrent scenarios and complex operations.

What are Function Annotations in Python?Apr 30, 2025 pm 02:06 PM

Function annotations in Python add metadata to functions for type checking, documentation, and IDE support. They enhance code readability, maintenance, and are crucial in API development, data science, and library creation.

What are unit tests in Python?Apr 30, 2025 pm 02:05 PM

The article discusses unit tests in Python, their benefits, and how to write them effectively. It highlights tools like unittest and pytest for testing.

What are Access Specifiers in Python?Apr 30, 2025 pm 02:03 PM

Article discusses access specifiers in Python, which use naming conventions to indicate visibility of class members, rather than strict enforcement.

What is __init__() in Python and how does self play a role in it?Apr 30, 2025 pm 02:02 PM

Article discusses Python's \_\_init\_\_() method and self's role in initializing object attributes. Other class methods and inheritance's impact on \_\_init\_\_() are also covered.

What is the difference between @classmethod, @staticmethod and instance methods in Python?Apr 30, 2025 pm 02:01 PM

The article discusses the differences between @classmethod, @staticmethod, and instance methods in Python, detailing their properties, use cases, and benefits. It explains how to choose the right method type based on the required functionality and da

How do you append elements to a Python array?Apr 30, 2025 am 12:19 AM

InPython,youappendelementstoalistusingtheappend()method.1)Useappend()forsingleelements:my_list.append(4).2)Useextend()or =formultipleelements:my_list.extend(another_list)ormy_list =[4,5,6].3)Useinsert()forspecificpositions:my_list.insert(1,5).Beaware

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

2 weeks agoByDDD

Roblox: Dead Rails – How To Summon And Defeat Nikola Tesla

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1

Easy-to-use and free code editor

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7861

1649

1404

1300

1242