


This article mainly introduces the installation and use guide of the Python crawler auxiliary tool PyQuery module. PyQuery can be easily used to parse HTML content, making it a favorite of many crawler program developers. Friends who need it can refer to it
Installation under Windows:
Download address: https://pypi.python.org/pypi/pyquery/#downloads
Download Post-installation:
C:\Python27>easy_install E:\python\pyquery-1.2.4.zip
You can also install directly online:
C:\Python27>easy_install pyquery
pyquery is a python library similar to jquery. You can use syntax like jquery to extract any data in the web page. This is used for data extraction and mining of html web pages. A very good third-party library. Let's take a look at the uses of pyquery.
Extract information from html string
#!/usr/bin/python # -*- coding: utf-8 -*- from pyquery import PyQuery as pq html = ''' <html> <head> <title>this is title</title> </head> <body> <p id="hi">Hello, World</p> <p id="hi2">Nihao</p> <div class="class1"> <img src="/static/imghwm/default1.png" data-src="1.jpg" class="lazy" / alt="Introduction to the installation and use guide of the Python crawler auxiliary tool PyQuery module" > </div> <ul> <li>list1</li> <li>list2</li> </ul> </body> </html> ''' d=pq(html) print d('title') # 相当于css选择器,根据html标签获取元素 print d('title').text() # text()方法获取当前选中的文本块 print d('#hi').text() # 相当于id选择器,直接根据id名获取元素 print d('p').filter('#hi2').text() # 可以根据id或class得到指定元素 print d('.class1') # 相当于class选择器 print d('.class1').html() # html()方法获取当前选中的html块 print d('.class1').find('img').attr('src') # 查找嵌套元素,并选中属性 print d('ul').find('li').eq(0).text() # 根据索引号获取多个相同html元素中的某一个 print d('ul').children() # 获取所有子元素 print d('ul').children().eq(0) #根据索引获取子元素 print d('img').parents() # 获取父元素 print d('#hi').next() # 获取下一个元素 print d('#hi').nextAll() #获取后面全部元素块 print d('p').not_('#hi2') # 返回不匹配选择器的元素 # 遍历所有匹配的元素 for i in d.items('li'): print i.text() print [i.text() for i in d.items('li')] # 遍历用于列表推倒 print d.make_links_absolute(base_url='http://www.baidu.com') # 把html文档中的相对路径变为绝对路径
The above code snippet gives Learn the commonly used operating methods of pyquery. We first defined a piece of HTML code, and then used a series of methods of pyquery to operate on the HTML code, mainly to obtain specific elements and text. Of course, pyquery can not only obtain elements, but also set element attributes, add elements and other functions. Since the most commonly used method is the method used in the above code, other methods will not be introduced here.
Extract information from url or local html file
Of course, pyquery can not only parse html strings like the above, but also like this:
d = pq(url='http://www.baidu.com/')
We can load a URL directly, there is no difference from the above operation method. This method uses the urllib module to make http requests by default, but if requests are installed in your system, requests will be used to make http requests, which means you can use any parameters of requests, such as:
pq('http://www.baidu.com/', headers={'user-agent': 'pyquery'})
Or, if you already have the corresponding html file in your local area, you can also do this:
d = pq(filename=path_to_html_file)
The above writing method directly specifies the local html file, and the operation method is still the same as above. same.
As you can see, pyquery provides us with full convenience to select any element, just like jquery.
Use pyquery to grab the top 250 Douban movies
After reading the syntax of pyquery, let’s look at an example to grab the top 250 Douban movies.
Because Douban’s anti-crawler is very powerful, I couldn’t catch it after running it a few times. I had to use requests to download the page first, and directly use pyquery to analyze the page to extract the information:
from pyquery import PyQuery as pq import requests head_req = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Referer':'https://movie.douban.com/top250?start=0', } r=requests.get("https://movie.douban.com/top250?start=0",headers=head_req) with open("1.html","wb") as html: html.write(r.content) d=pq(filename="1.html") # print d('ol').find('li').html() for data in d('ol').items('li'): print data.find('.hd').find('.title').eq(0).text() print data.find('.star').find('.rating_num').text() print data.find('.quote').find('.inq').text() print
Run it and see the result:
肖申克的救赎 9.6 希望让人自由。 这个杀手不太冷 9.4 怪蜀黍和小萝莉不得不说的故事。 阿甘正传 9.4 一部美国近现代史。 霸王别姬 9.4 风华绝代。 美丽人生 9.5 最美的谎言。 千与千寻 9.2 最好的宫崎骏,最好的久石让。 辛德勒的名单 9.4 拯救一个人,就是拯救整个世界。 海上钢琴师 9.2 每个人都要走一条自己坚定了的路,就算是粉身碎骨。 机器人总动员 9.3 小瓦力,大人生。 盗梦空间 9.2 诺兰给了我们一场无法盗取的梦。 泰坦尼克号 9.1 失去的才是永恒的。 三傻大闹宝莱坞 9.1 英俊版憨豆,高情商版谢耳朵。 放牛班的春天 9.2 天籁一般的童声,是最接近上帝的存在。 忠犬八公的故事 9.2 永远都不能忘记你所爱的人。 龙猫 9.1 人人心中都有个龙猫,童年就永远不会消失。 大话西游之大圣娶亲 9.1 一生所爱。 教父 9.2 千万不要记恨你的对手,这样会让你失去理智。 乱世佳人 9.2 Tomorrow is another day. 天堂电影院 9.1 那些吻戏,那些青春,都在影院的黑暗里被泪水冲刷得无比清晰。 当幸福来敲门 8.9 平民励志片。 搏击俱乐部 9.0 邪恶与平庸蛰伏于同一个母体,在特定的时间互相对峙。 楚门的世界 9.0 如果再也不能见到你,祝你早安,午安,晚安。 触不可及 9.1 满满温情的高雅喜剧。 指环王3:王者无敌 9.1 史诗的终章。 罗马假日 8.9 爱情哪怕只有一天。
Of course, this is only the 25 items on the first page. We already know the URL of the top 250 Douban movies.
https://movie.douban.com/top250?start=0
The start parameter starts from 0 and increases by 25 each time until
https://movie. douban.com/top250?start=225
So you can write a loop to catch them all.
For more related articles on the installation and use of the Python crawler auxiliary tool PyQuery module, please pay attention to the PHP Chinese website!

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Linux new version
SublimeText3 Linux latest version

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 English version
Recommended: Win version, supports code prompts!

Dreamweaver Mac version
Visual web development tools
