Home >Backend Development >Python Tutorial >Python crawls Douban movie data and extracts value xpath and lxml modules (code)
The content this article brings to you is about Python crawling Douban movie data and extracting value xpath and lxml modules (code). It has certain reference value. Friends in need can refer to it. I hope it will be useful to you. help.
Tools: Python 3.6.5, PyCharm development tools, Windows 10 operating system, Google Chrome
Purpose: crawl the title, link address of the movie in the Douban movie rankings, Pictures, number of reviewers, ratings, etc.
Website: https://movie.douban.com/chart
Grammar points:
xpath syntax:
Google Chrome installs the xpath helper plug-in: Help us locate data from elements
1. Select the node (label)
(1),/html/ head/meta: Can select all meta tags under html
(2), //li: All li tags on the current page
(3), /html/head//link: All link tags under head
##2, //: Can be selected from any node
(1)、//li:All li tags on the current page
## (2)、/html/head//link:head All link tags under3. The purpose of the @ symbol
(1) Select a specific element: //p[ @class='feed']/ul/li, select li under ul under p of
class='feed'
(2), a/@href: Select the href value of a
4. Get the text## ( 1), /a/text(): Get the text under a
(2), /a//text(): Get all the text under a Text
Example
:
##lxml syntax:
1. Installation: pip install lxml
2. Use
from lxml import etree
## element = etree.HTML("html string ")
element.xpath("")
Code:
from lxml import etree import requests url = "https://movie.douban.com/chart" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36" } response = requests.get(url,headers=headers) html_str = response.content.decode() #print(html_str) html = etree.HTML(html_str) print(html) #1.获取所有的电影的URL地址 #url_list = html.xpath("//div[@class='indent']/div/table//div[@class='pl2']/a/@href") #print(url_list) #2.所有图片的地址 #img_list = html.xpath("//div[@class='indent']/div/table//a[@class='nbg']/img/@src") #print(img_list) ret1 = html.xpath("//div[@class='indent']/div/table") print(ret1) for table in ret1: item = {} item["title"] = table.xpath(".//div[@class='pl2']/a/text()")[0].replace("/","").strip() item["href"] = table.xpath(".//div[@class='pl2']/a/@href")[0] item["img"] = table.xpath(".//a[@class='nbg']/img/@src")[0] item["comment_num"] = table.xpath(".//span[@class='pl']/text()")[0] item["rating_num"] = table.xpath(".//span[@class='rating_nums']/text()")[0] print(item)
Running effect:
The above is the detailed content of Python crawls Douban movie data and extracts value xpath and lxml modules (code). For more information, please follow other related articles on the PHP Chinese website!