Home >Backend Development >Python Tutorial >Python crawls Douban movie data and extracts value xpath and lxml modules (code)

Python crawls Douban movie data and extracts value xpath and lxml modules (code)

不言
不言forward
2018-09-28 14:45:343887browse

The content this article brings to you is about Python crawling Douban movie data and extracting value xpath and lxml modules (code). It has certain reference value. Friends in need can refer to it. I hope it will be useful to you. help.

Tools: Python 3.6.5, PyCharm development tools, Windows 10 operating system, Google Chrome

Purpose: crawl the title, link address of the movie in the Douban movie rankings, Pictures, number of reviewers, ratings, etc.

Website: https://movie.douban.com/chart

Grammar points:

xpath syntax:

Google Chrome installs the xpath helper plug-in: Help us locate data from elements

1. Select the node (label)

 (1),/html/ head/meta: Can select all meta tags under html

(2), //li: All li tags on the current page

(3), /html/head//link: All link tags under head

##2, //: Can be selected from any node

 (1)、//li:All li tags on the current page

## (2)、/html/head//link:head All link tags under

3. The purpose of the @ symbol

(1) Select a specific element: //p[ @class='feed']/ul/li, select li under ul under p of

class='feed'

(2), a/@href: Select the href value of a

4. Get the text

##  ( 1), /a/text(): Get the text under a

 (2), /a//text(): Get all the text under a Text

Example

:

##lxml syntax:

1. Installation: pip install lxml

2. Use

 from lxml import etree


## element = etree.HTML("html string ")


 element.xpath("")

Code:

from lxml import etree
import requests

url = "https://movie.douban.com/chart"

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36"
}
response = requests.get(url,headers=headers)
html_str = response.content.decode()

#print(html_str)

html = etree.HTML(html_str)
print(html)

#1.获取所有的电影的URL地址
#url_list = html.xpath("//div[@class='indent']/div/table//div[@class='pl2']/a/@href")
#print(url_list)

#2.所有图片的地址
#img_list = html.xpath("//div[@class='indent']/div/table//a[@class='nbg']/img/@src")
#print(img_list)
ret1 = html.xpath("//div[@class='indent']/div/table")
print(ret1)
for table in ret1:
    item = {}
    item["title"] = table.xpath(".//div[@class='pl2']/a/text()")[0].replace("/","").strip()
    item["href"] = table.xpath(".//div[@class='pl2']/a/@href")[0]
    item["img"] = table.xpath(".//a[@class='nbg']/img/@src")[0]
    item["comment_num"] = table.xpath(".//span[@class='pl']/text()")[0]
    item["rating_num"] = table.xpath(".//span[@class='rating_nums']/text()")[0]
    print(item)

Running effect:

The above is the detailed content of Python crawls Douban movie data and extracts value xpath and lxml modules (code). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:cnblogs.com. If there is any infringement, please contact admin@php.cn delete