Introduction to the installation and use guide of the Python crawler auxiliary tool PyQuery module-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Introduction to the installation and use guide of the Python crawler auxiliary tool PyQuery module

高洛峰

Mar 04, 2017 pm 04:04 PM

This article mainly introduces the installation and use guide of the Python crawler auxiliary tool PyQuery module. PyQuery can be easily used to parse HTML content, making it a favorite of many crawler program developers. Friends who need it can refer to it

Installation under Windows:
Download address: https://pypi.python.org/pypi/pyquery/#downloads

Download Post-installation:

C:\Python27>easy_install E:\python\pyquery-1.2.4.zip

You can also install directly online:

C:\Python27>easy_install pyquery

pyquery is a python library similar to jquery. You can use syntax like jquery to extract any data in the web page. This is used for data extraction and mining of html web pages. A very good third-party library. Let's take a look at the uses of pyquery.

Extract information from html string

#!/usr/bin/python
# -*- coding: utf-8 -*-
 
from pyquery import PyQuery as pq
html = &#39;&#39;&#39;
<html>
<head>
 <title>this is title</title>
</head>
<body>
 <p id="hi">Hello, World</p>
 <p id="hi2">Nihao</p>
 <div class="class1">
  <img  src="/static/imghwm/default1.png"  data-src="1.jpg"  class="lazy"   / alt="Introduction to the installation and use guide of the Python crawler auxiliary tool PyQuery module" >
 </div>
 <ul>
  <li>list1</li>
  <li>list2</li>
 </ul>
</body>
</html>
&#39;&#39;&#39;
d=pq(html)
 
print d(&#39;title&#39;) # 相当于css选择器，根据html标签获取元素
print d(&#39;title&#39;).text() # text()方法获取当前选中的文本块
 
print d(&#39;#hi&#39;).text() # 相当于id选择器，直接根据id名获取元素
print d(&#39;p&#39;).filter(&#39;#hi2&#39;).text() # 可以根据id或class得到指定元素
print d(&#39;.class1&#39;) # 相当于class选择器
print d(&#39;.class1&#39;).html() # html()方法获取当前选中的html块
print d(&#39;.class1&#39;).find(&#39;img&#39;).attr(&#39;src&#39;) # 查找嵌套元素，并选中属性
print d(&#39;ul&#39;).find(&#39;li&#39;).eq(0).text() # 根据索引号获取多个相同html元素中的某一个
print d(&#39;ul&#39;).children() # 获取所有子元素
print d(&#39;ul&#39;).children().eq(0) #根据索引获取子元素
print d(&#39;img&#39;).parents() # 获取父元素
print d(&#39;#hi&#39;).next() # 获取下一个元素
print d(&#39;#hi&#39;).nextAll() #获取后面全部元素块
print d(&#39;p&#39;).not_(&#39;#hi2&#39;) # 返回不匹配选择器的元素
# 遍历所有匹配的元素
for i in d.items(&#39;li&#39;):
 print i.text()
print [i.text() for i in d.items(&#39;li&#39;)] # 遍历用于列表推倒
print d.make_links_absolute(base_url=&#39;http://www.baidu.com&#39;) # 把html文档中的相对路径变为绝对路径

The above code snippet gives Learn the commonly used operating methods of pyquery. We first defined a piece of HTML code, and then used a series of methods of pyquery to operate on the HTML code, mainly to obtain specific elements and text. Of course, pyquery can not only obtain elements, but also set element attributes, add elements and other functions. Since the most commonly used method is the method used in the above code, other methods will not be introduced here.

Extract information from url or local html file

Of course, pyquery can not only parse html strings like the above, but also like this:

d = pq(url=&#39;http://www.baidu.com/&#39;)

We can load a URL directly, there is no difference from the above operation method. This method uses the urllib module to make http requests by default, but if requests are installed in your system, requests will be used to make http requests, which means you can use any parameters of requests, such as:

pq(&#39;http://www.baidu.com/&#39;, headers={&#39;user-agent&#39;: &#39;pyquery&#39;})

Or, if you already have the corresponding html file in your local area, you can also do this:

d = pq(filename=path_to_html_file)

The above writing method directly specifies the local html file, and the operation method is still the same as above. same.
As you can see, pyquery provides us with full convenience to select any element, just like jquery.

Use pyquery to grab the top 250 Douban movies

After reading the syntax of pyquery, let’s look at an example to grab the top 250 Douban movies.
Because Douban’s anti-crawler is very powerful, I couldn’t catch it after running it a few times. I had to use requests to download the page first, and directly use pyquery to analyze the page to extract the information:

from pyquery import PyQuery as pq
import requests
 
head_req = {
 &#39;User-Agent&#39;:&#39;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36&#39;,
 &#39;Referer&#39;:&#39;https://movie.douban.com/top250?start=0&#39;,
 
}
r=requests.get("https://movie.douban.com/top250?start=0",headers=head_req)
with open("1.html","wb") as html:
 html.write(r.content)
 
d=pq(filename="1.html")
 
# print d(&#39;ol&#39;).find(&#39;li&#39;).html()
for data in d(&#39;ol&#39;).items(&#39;li&#39;):
 print data.find(&#39;.hd&#39;).find(&#39;.title&#39;).eq(0).text()
 print data.find(&#39;.star&#39;).find(&#39;.rating_num&#39;).text()
 print data.find(&#39;.quote&#39;).find(&#39;.inq&#39;).text()
 print

Run it and see the result:

肖申克的救赎
9.6
希望让人自由。

这个杀手不太冷
9.4
怪蜀黍和小萝莉不得不说的故事。

阿甘正传
9.4
一部美国近现代史。

霸王别姬
9.4
风华绝代。

美丽人生
9.5
最美的谎言。

千与千寻
9.2
最好的宫崎骏，最好的久石让。

辛德勒的名单
9.4
拯救一个人，就是拯救整个世界。

海上钢琴师
9.2
每个人都要走一条自己坚定了的路，就算是粉身碎骨。

机器人总动员
9.3
小瓦力，大人生。

盗梦空间
9.2
诺兰给了我们一场无法盗取的梦。

泰坦尼克号
9.1
失去的才是永恒的。

三傻大闹宝莱坞
9.1
英俊版憨豆，高情商版谢耳朵。

放牛班的春天
9.2
天籁一般的童声，是最接近上帝的存在。

忠犬八公的故事
9.2
永远都不能忘记你所爱的人。

龙猫
9.1
人人心中都有个龙猫，童年就永远不会消失。

大话西游之大圣娶亲
9.1
一生所爱。

教父
9.2
千万不要记恨你的对手，这样会让你失去理智。

乱世佳人
9.2
Tomorrow is another day.

天堂电影院
9.1
那些吻戏，那些青春，都在影院的黑暗里被泪水冲刷得无比清晰。

当幸福来敲门
8.9
平民励志片。

搏击俱乐部
9.0
邪恶与平庸蛰伏于同一个母体，在特定的时间互相对峙。

楚门的世界
9.0
如果再也不能见到你，祝你早安，午安，晚安。

触不可及
9.1
满满温情的高雅喜剧。

指环王3：王者无敌
9.1
史诗的终章。

罗马假日
8.9
爱情哪怕只有一天。

Of course, this is only the 25 items on the first page. We already know the URL of the top 250 Douban movies.

https://movie.douban.com/top250?start=0
The start parameter starts from 0 and increases by 25 each time until

https://movie. douban.com/top250?start=225
So you can write a loop to catch them all.

For more related articles on the installation and use of the Python crawler auxiliary tool PyQuery module, please pay attention to the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What are the alternatives to concatenate two lists in Python?May 09, 2025 am 12:16 AM

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

Python: Efficient Ways to Merge Two ListsMay 09, 2025 am 12:15 AM

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiled vs Interpreted Languages: pros and consMay 09, 2025 am 12:06 AM

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

Python: For and While Loops, the most complete guideMay 09, 2025 am 12:05 AM

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

Python concatenate lists into a stringMay 09, 2025 am 12:02 AM

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Python's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

Learn the Differences Between Python's 'for' and 'while' LoopsMay 08, 2025 am 12:11 AM

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

Python concatenate lists with duplicatesMay 08, 2025 am 12:09 AM

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Linux new version

SublimeText3 Linux latest version

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software