


Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology
It was my first time to learn crawler technology. I read a joke on Zhihu about how to crawl to the Encyclopedia of Embarrassing Things, so I decided to make one myself.
Achieve goals: 1. Crawling to the jokes in the Encyclopedia of Embarrassing Things
2. Crawling one paragraph every time and crawling to the next page every time you press Enter
Technical implementation: Based on the implementation of python, using the Requests library, re library, and the BeautifulSoup method of the bs4 library to implement
Main content: First, we need to clarify the ideas for crawling implementation , let’s build the main framework. In the first step, we first write a method to obtain web pages using the Requests library. In the second step, we use the BeautifulSoup method of the bs4 library to analyze the obtained web page information and use regular expressions to match relevant paragraph information. . The third step is to print out the obtained information. We all execute the above methods through a main function .
First, import the relevant libraries
import requests from bs4 import BeautifulSoup import bs4 import re
Second, first obtain the web page information
def getHTMLText(url): try: user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent': user_agent} r = requests.get(url,headers = headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return ""
Third, put the information into r and then analyze it
soup = BeautifulSoup(html,"html.parser")
What we need is the content and publisher of the joke. By viewing the source code on the web page, we know that the publisher of the joke is:
'p', attrs={'class': 'content'}中
The content of the joke is in
'p', attrs={'class': 'author clearfix'}中
, so we pass bs4 Library method to extract the specific content of these two tags
def fillUnivlist(lis,li,html,count): soup = BeautifulSoup(html,"html.parser") try: a = soup.find_all('p', attrs={'class': 'content'}) ll = soup.find_all('p', attrs={'class': 'author clearfix'})
Then obtain the information through specific regular expressions
for sp in a: patten = re.compile(r'<span>(.*?)</span>',re.S) Info = re.findall(patten,str(sp)) lis.append(Info) count = count + 1 for mc in ll: namePatten = re.compile(r'<h2 id="">(.*?)</h2>', re.S) d = re.findall(namePatten, str(mc)) li.append(d)
What we need to pay attention to is the return of find_all and re’s findall method They are all a list. When using regular expressions, we only roughly extract and do not remove the line breaks in the tags
Next, we only need to combine the contents of the two lists and output them
def printUnivlist(lis,li,count): for i in range(count): a = li[i][0] b = lis[i][0] print ("%s:"%a+"%s"%b)
Then I make an input control function, enter Q to return an error, exit, enter Enter to return correct, and load the next page of paragraphs
def input_enter(): input1 = input() if input1 == 'Q': return False else: return True
We realize the input control through the main function. If If the control function returns an error, the output will not be performed. If the return value is correct, the output will continue. We load the next page through a for loop.
def main(): passage = 0 enable = True for i in range(20): mc = input_enter() if mc==True: lit = [] li = [] count = 0 passage = passage + 1 qbpassage = passage print(qbpassage) url = 'http://www.qiushibaike.com/8hr/page/' + str(qbpassage) + '/?s=4966318' a = getHTMLText(url) fillUnivlist(lit, li, a, count) number = fillUnivlist(lit, li, a, count) printUnivlist(lit, li, number) else: break
Here we need to note that every for loop will refresh lis[] and li[], so that the paragraph content of the webpage can be correctly output every time
Here is the source code :
import requests from bs4 import BeautifulSoup import bs4 import re def getHTMLText(url): try: user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = {'User-Agent': user_agent} r = requests.get(url,headers = headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivlist(lis,li,html,count): soup = BeautifulSoup(html,"html.parser") try: a = soup.find_all('p', attrs={'class': 'content'}) ll = soup.find_all('p', attrs={'class': 'author clearfix'}) for sp in a: patten = re.compile(r'(.*?)',re.S) Info = re.findall(patten,str(sp)) lis.append(Info) count = count + 1 for mc in ll: namePatten = re.compile(r'(.*?)
', re.S) d = re.findall(namePatten, str(mc)) li.append(d) except: return "" return count def printUnivlist(lis,li,count): for i in range(count): a = li[i][0] b = lis[i][0] print ("%s:"%a+"%s"%b) def input_enter(): input1 = input() if input1 == 'Q': return False else: return True def main(): passage = 0 enable = True for i in range(20): mc = input_enter() if mc==True: lit = [] li = [] count = 0 passage = passage + 1 qbpassage = passage print(qbpassage) url = 'http://www.qiushibaike.com/8hr/page/' + str(qbpassage) + '/?s=4966318' a = getHTMLText(url) fillUnivlist(lit, li, a, count) number = fillUnivlist(lit, li, a, count) printUnivlist(lit, li, number) else: break main()
This is my first time doing it and there are still many areas that can be optimized. I hope everyone can point it out.
The above is the detailed content of Detailed explanation of the method of crawling to the Encyclopedia of Embarrassing Things using Python's crawler technology. For more information, please follow other related articles on the PHP Chinese website!

The article discusses Python's new "match" statement introduced in version 3.10, which serves as an equivalent to switch statements in other languages. It enhances code readability and offers performance benefits over traditional if-elif-el

Exception Groups in Python 3.11 allow handling multiple exceptions simultaneously, improving error management in concurrent scenarios and complex operations.

Function annotations in Python add metadata to functions for type checking, documentation, and IDE support. They enhance code readability, maintenance, and are crucial in API development, data science, and library creation.

The article discusses unit tests in Python, their benefits, and how to write them effectively. It highlights tools like unittest and pytest for testing.

Article discusses access specifiers in Python, which use naming conventions to indicate visibility of class members, rather than strict enforcement.

Article discusses Python's \_\_init\_\_() method and self's role in initializing object attributes. Other class methods and inheritance's impact on \_\_init\_\_() are also covered.

The article discusses the differences between @classmethod, @staticmethod, and instance methods in Python, detailing their properties, use cases, and benefits. It explains how to choose the right method type based on the required functionality and da

InPython,youappendelementstoalistusingtheappend()method.1)Useappend()forsingleelements:my_list.append(4).2)Useextend()or =formultipleelements:my_list.extend(another_list)ormy_list =[4,5,6].3)Useinsert()forspecificpositions:my_list.insert(1,5).Beaware


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Notepad++7.3.1
Easy-to-use and free code editor

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
