Home >Backend Development >Python Tutorial >Practical use of crawlers in Python: Douban book crawler
Python is one of the most popular programming languages today and has been widely used in different fields, such as data science, artificial intelligence, network security, etc. Among them, Python performs well in the field of web crawlers, and many companies and individuals use Python for data collection and analysis. This article will introduce how to use Python to crawl Douban book information and help readers have a preliminary understanding of the implementation methods and technologies of Python web crawlers.
First of all, for Douban book information crawler, we need to use two important libraries in Python: urllib and beautifulsoup4. Among them, the urllib library is mainly used for network requests and data reading, while the beautifulsoup4 library can be used to parse structured documents such as HTML and XML to extract the required information. Before using these libraries, we need to install them first. Use the pip command to complete the installation. After the installation is completed, we can start our actual combat.
When using Python to crawl, you first need to clarify the crawling target. For this article, our goal is to crawl the basic information of Douban books, such as book title, author, publisher, publication date, ratings, etc. In addition, we also need to crawl multiple pages of book information.
After determining the crawling target, we need to further analyze the HTML structure of Douban Books to determine the location and characteristics of the required information. We can use the developer tools that come with browsers such as Chrome or Firefox to view the page source code. By observing the HTML structure, we can find the tags and attributes that need to be crawled, and then write Python code to implement them.
Next, we write the Douban book crawler code in Python. The core of the code is:
The following is the complete code:
import urllib.request from bs4 import BeautifulSoup url = 'https://book.douban.com/top250' books = [] def get_html(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'} req = urllib.request.Request(url, headers=headers) response = urllib.request.urlopen(req) html = response.read().decode('utf-8') return html def parse_html(html): soup = BeautifulSoup(html,'html.parser') book_list_soup = soup.find('div', attrs={'class': 'article'}) for book_soup in book_list_soup.find_all('table'): book_title_soup = book_soup.find('div', attrs={'class': 'pl2'}) book_title_link = book_title_soup.find('a') book_title = book_title_link.get('title') book_url = book_title_link.get('href') book_info_soup = book_soup.find('p', attrs={'class': 'pl'}) book_info = book_info_soup.string.strip() book_rating_num_soup = book_soup.find('span', attrs={'class': 'rating_nums'}) book_rating_num = book_rating_num_soup.string.strip() book_rating_people_num_span_soup = book_soup.find('span', attrs={'class': 'pl'}) book_rating_people_num = book_rating_people_num_span_soup.string.strip()[1:-4] book_author_and_publish_soup = book_soup.find('p',attrs={'class':'pl'}).next_sibling.string.strip() book_author_and_publish = book_author_and_publish_soup.split('/') book_author = book_author_and_publish[0] book_publish = book_author_and_publish[-3] book_year = book_author_and_publish[-2] books.append({ 'title': book_title, 'url': book_url, 'info': book_info, 'author':book_author, 'publish':book_publish, 'year':book_year, 'rating_num':book_rating_num, 'rating_people_num':book_rating_people_num }) def save_data(): with open('douban_top250.txt','w',encoding='utf-8') as f: for book in books: f.write('书名:{0} '.format(book['title'])) f.write('链接:{0} '.format(book['url'])) f.write('信息:{0} '.format(book['info'])) f.write('作者:{0} '.format(book['author'])) f.write('出版社:{0} '.format(book['publish'])) f.write('出版年份:{0} '.format(book['year'])) f.write('评分:{0} '.format(book['rating_num'])) f.write('评分人数:{0} '.format(book['rating_people_num'])) if __name__ == '__main__': for i in range(10): start = i*25 url = 'https://book.douban.com/top250?start={0}'.format(start) html = get_html(url) parse_html(html) save_data()
Code analysis:
First, we define a main website url and an empty list books (used to store book information) . Next, we write the get_html function to send a request and obtain the HTML page. In this function, we set the request headers to simulate the browser sending a request to avoid being blocked by the website. We use the Request method of the urllib library to encapsulate the request header and URL into an object, then use the urlopen method of the urllib library to send a network request and obtain the page, and finally use the read and decode methods to convert the page content into utf-8 format. String.
We write the parse_html function to parse HTML documents and extract the required information. In this function, we use the find and find_all methods of the beautifulsoup4 library to find tags and attributes that meet the requirements in the HTML page. Specifically, by observing the HTML structure of Douban books, we found the table tag of each book and the corresponding book title, link, information, rating and other information, and wrote the code to extract these data. Among them, we used the strip and split methods to process the string to remove excess whitespace characters and split the string.
Finally, we wrote the save_data function to store the extracted book information into a local file. In this function, we use Python's built-in function open to open a text file, write the file content in write mode, and use the format method to format the relevant information of each book into a string and write it to the file. Note that we need to add encoding='utf-8' after the file name to ensure that the file content will not be garbled.
In the main program, we use a for loop to crawl the first 250 books on Douban Books. To do this, we need to crawl 25 books per page and crawl 10 pages in total. In each loop, we calculate the required url based on the current page number and call the get_html function to obtain the HTML page. Next, we pass the page to the parse_html function, which parses the page and extracts the required information. Finally, we call the save_data function to save all book information to a local file.
After completing the code writing, we can enter the directory where the code is located in the command line (Windows system) or terminal (MacOS or Linux system), and Execute the command python3 crawler script name.py to run the Python web crawler. While the program is running, we can observe the output information of the program to determine whether the program is executed correctly. After the program is executed, we can check the local file douban_top250.txt to confirm whether the data has been saved successfully.
Summary
Through the introduction of this article, we have a preliminary understanding of the implementation methods and technologies of Python web crawlers. Specifically, we used the urllib and beautifulsoup4 libraries in Python to write a Python program to crawl Douban Books information based on the HTML structure of the Douban Books website, and successfully implemented data collection and storage. In addition, in practical applications, we need to understand some precautions for web crawlers, such as not to send requests to the same website too frequently to avoid having the IP address blocked.
The above is the detailed content of Practical use of crawlers in Python: Douban book crawler. For more information, please follow other related articles on the PHP Chinese website!