Home >Backend Development >Python Tutorial >Introduction to the installation and use guide of the Python crawler auxiliary tool PyQuery module
This article mainly introduces the installation and use guide of the Python crawler auxiliary tool PyQuery module. PyQuery can be easily used to parse HTML content, making it a favorite of many crawler program developers. Friends who need it can refer to it
Installation under Windows:
Download address: https://pypi.python.org/pypi/pyquery/#downloads
Download Post-installation:
C:\Python27>easy_install E:\python\pyquery-1.2.4.zip
You can also install directly online:
C:\Python27>easy_install pyquery
pyquery is a python library similar to jquery. You can use syntax like jquery to extract any data in the web page. This is used for data extraction and mining of html web pages. A very good third-party library. Let's take a look at the uses of pyquery.
Extract information from html string
#!/usr/bin/python # -*- coding: utf-8 -*- from pyquery import PyQuery as pq html = ''' <html> <head> <title>this is title</title> </head> <body> <p id="hi">Hello, World</p> <p id="hi2">Nihao</p> <div class="class1"> <img src="1.jpg" /> </div> <ul> <li>list1</li> <li>list2</li> </ul> </body> </html> ''' d=pq(html) print d('title') # 相当于css选择器,根据html标签获取元素 print d('title').text() # text()方法获取当前选中的文本块 print d('#hi').text() # 相当于id选择器,直接根据id名获取元素 print d('p').filter('#hi2').text() # 可以根据id或class得到指定元素 print d('.class1') # 相当于class选择器 print d('.class1').html() # html()方法获取当前选中的html块 print d('.class1').find('img').attr('src') # 查找嵌套元素,并选中属性 print d('ul').find('li').eq(0).text() # 根据索引号获取多个相同html元素中的某一个 print d('ul').children() # 获取所有子元素 print d('ul').children().eq(0) #根据索引获取子元素 print d('img').parents() # 获取父元素 print d('#hi').next() # 获取下一个元素 print d('#hi').nextAll() #获取后面全部元素块 print d('p').not_('#hi2') # 返回不匹配选择器的元素 # 遍历所有匹配的元素 for i in d.items('li'): print i.text() print [i.text() for i in d.items('li')] # 遍历用于列表推倒 print d.make_links_absolute(base_url='http://www.baidu.com') # 把html文档中的相对路径变为绝对路径
The above code snippet gives Learn the commonly used operating methods of pyquery. We first defined a piece of HTML code, and then used a series of methods of pyquery to operate on the HTML code, mainly to obtain specific elements and text. Of course, pyquery can not only obtain elements, but also set element attributes, add elements and other functions. Since the most commonly used method is the method used in the above code, other methods will not be introduced here.
Extract information from url or local html file
Of course, pyquery can not only parse html strings like the above, but also like this:
d = pq(url='http://www.baidu.com/')
We can load a URL directly, there is no difference from the above operation method. This method uses the urllib module to make http requests by default, but if requests are installed in your system, requests will be used to make http requests, which means you can use any parameters of requests, such as:
pq('http://www.baidu.com/', headers={'user-agent': 'pyquery'})
Or, if you already have the corresponding html file in your local area, you can also do this:
d = pq(filename=path_to_html_file)
The above writing method directly specifies the local html file, and the operation method is still the same as above. same.
As you can see, pyquery provides us with full convenience to select any element, just like jquery.
Use pyquery to grab the top 250 Douban movies
After reading the syntax of pyquery, let’s look at an example to grab the top 250 Douban movies.
Because Douban’s anti-crawler is very powerful, I couldn’t catch it after running it a few times. I had to use requests to download the page first, and directly use pyquery to analyze the page to extract the information:
from pyquery import PyQuery as pq import requests head_req = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Referer':'https://movie.douban.com/top250?start=0', } r=requests.get("https://movie.douban.com/top250?start=0",headers=head_req) with open("1.html","wb") as html: html.write(r.content) d=pq(filename="1.html") # print d('ol').find('li').html() for data in d('ol').items('li'): print data.find('.hd').find('.title').eq(0).text() print data.find('.star').find('.rating_num').text() print data.find('.quote').find('.inq').text() print
Run it and see the result:
肖申克的救赎 9.6 希望让人自由。 这个杀手不太冷 9.4 怪蜀黍和小萝莉不得不说的故事。 阿甘正传 9.4 一部美国近现代史。 霸王别姬 9.4 风华绝代。 美丽人生 9.5 最美的谎言。 千与千寻 9.2 最好的宫崎骏,最好的久石让。 辛德勒的名单 9.4 拯救一个人,就是拯救整个世界。 海上钢琴师 9.2 每个人都要走一条自己坚定了的路,就算是粉身碎骨。 机器人总动员 9.3 小瓦力,大人生。 盗梦空间 9.2 诺兰给了我们一场无法盗取的梦。 泰坦尼克号 9.1 失去的才是永恒的。 三傻大闹宝莱坞 9.1 英俊版憨豆,高情商版谢耳朵。 放牛班的春天 9.2 天籁一般的童声,是最接近上帝的存在。 忠犬八公的故事 9.2 永远都不能忘记你所爱的人。 龙猫 9.1 人人心中都有个龙猫,童年就永远不会消失。 大话西游之大圣娶亲 9.1 一生所爱。 教父 9.2 千万不要记恨你的对手,这样会让你失去理智。 乱世佳人 9.2 Tomorrow is another day. 天堂电影院 9.1 那些吻戏,那些青春,都在影院的黑暗里被泪水冲刷得无比清晰。 当幸福来敲门 8.9 平民励志片。 搏击俱乐部 9.0 邪恶与平庸蛰伏于同一个母体,在特定的时间互相对峙。 楚门的世界 9.0 如果再也不能见到你,祝你早安,午安,晚安。 触不可及 9.1 满满温情的高雅喜剧。 指环王3:王者无敌 9.1 史诗的终章。 罗马假日 8.9 爱情哪怕只有一天。
Of course, this is only the 25 items on the first page. We already know the URL of the top 250 Douban movies.
https://movie.douban.com/top250?start=0
The start parameter starts from 0 and increases by 25 each time until
https://movie. douban.com/top250?start=225
So you can write a loop to catch them all.
For more related articles on the installation and use of the Python crawler auxiliary tool PyQuery module, please pay attention to the PHP Chinese website!