Home  >  Article  >  Backend Development  >  Crawler parsing method four: PyQuery

Crawler parsing method four: PyQuery

爱喝马黛茶的安东尼
爱喝马黛茶的安东尼forward
2019-06-05 15:14:533288browse

Many languages ​​​​can crawl, but crawlers based on python are more concise and convenient. Crawlers have also become an essential part of the python language. There are also many ways to parse crawlers. The previous article told you about the third method of crawler analysis: regular expression. Today I bring you another method, PyQuery.

Crawler parsing method four: PyQuery

PyQuery

The PyQuery library is also a very powerful and flexible web page parsing library. If you have front-end development experience, you can use it. If you have been exposed to jQuery, then PyQuery is a very good choice for you. PyQuery is a strict implementation of Python modeled after jQuery. The syntax is almost identical to jQuery, so no more trying to remember weird methods.

There are generally three ways to pass in during initialization: pass in string, pass in url, pass in file.

String initialization

html = 
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
</div>
from pyquery 
import PyQuery as pq
doc = pq(html)print(doc)
print(type(doc))
print(doc(&#39;li&#39;))

The results are as follows:

Crawler parsing method four: PyQuery

Since PyQuery is more troublesome to write, we import it Alias ​​will be added when:

from pyquery import PyQuery as pq

Here we can know that the doc in the above code is actually a pyquery object. We can select elements through doc. In fact, this is a css selector, so All CSS selector rules can be used. You can directly doc (tag name) to get all the content of the tag. If you want to get the class, then doc('.class_name'), if it is the id, then doc('#id_name') ....

URL initialization

from pyquery import PyQuery as pq
doc = pq(url="http://www.baidu.com",encoding=&#39;utf-8&#39;)print(doc(&#39;head&#39;))

File initialization

We can pass in url parameters or file parameters here in pq() , of course the file here is usually an html file, for example: pq(filename='index.html')

Basic CSS selector

html = &#39;&#39;&#39;
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
print(doc(&#39;#container .list li&#39;))

One thing we need to pay attention to here is the doc ('#container .list li'), the three here do not have to be next to each other, as long as there is a hierarchical relationship, the following is the commonly used CSS selector method:

Crawler parsing method four: PyQuery

Find elements

Child elements
children,find
Code example:

html = &#39;&#39;&#39;
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
items = doc(&#39;.list&#39;)
print(type(items))
print(items)
lis = items.find(&#39;li&#39;)
print(type(lis))
print(lis)

The running results are as follows

We can also see from the results that the result found through pyquery is actually a pyquery object, and you can continue to search. items.find('li') in the above code means to find all li in ul. Tag
Of course, the same effect can be achieved through children, and the result obtained through the .children method is also a pyquery object

li = items.children()
print(type(li))
print(li)

At the same time, the CSS selector can also be used in children

li2 = items.children(&#39;.active&#39;) 
print(li2)

Parent element
parent,parents method

You can find the content of the parent element through .parent. The example is as follows:

html = &#39;&#39;&#39;<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>&#39;&#39;&#39;from pyquery import PyQuery as pq
doc = pq(html)
items = doc(&#39;.list&#39;)
container = items.parent()
print(type(container))
print(container)

You can find the ancestor node through .parents The content, examples are as follows:

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
items = doc(&#39;.list&#39;)
parents = items.parents()
print(type(parents))
print(parents)

The results are as follows: From the results we can see that two parts of content are returned, one is the information of the parent node, and the other is the information of the parent node of the parent node, that is, the information of the ancestor node

Similarly when we search through .parents, we can also add css selectors to filter the content

Sibling elements
siblings

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
li = doc(&#39;.list .item-0.active&#39;)
print(li.siblings())

In the code, .tem-0 and .active in doc('.list .item-0.active') are next to each other, so they are in a merged relationship, so there is only one left that meets the conditions: the third item That tag
In this way, you can get all the sibling tags through .siblings, of course, your own is not included here
Similarly, in .siblings(), you can also filter through the CSS selector

Traversal

Single element

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
</div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
li = doc(&#39;.item-0.active&#39;)
print(li)
lis = doc(&#39;li&#39;).items()
print(type(lis))for li in lis:    
print(type(li))    
print(li)

The running results are as follows: From the results, we can see that a generator can be obtained through items(), And each element we get through the for loop is still a pyquery object.

Get information

Get attributes
pyquery object.attr(attribute name)
pyquery object.attr.attribute name

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
a = doc(&#39;.item-0.active a&#39;)
print(a)
print(a.attr(&#39;href&#39;))
print(a.attr.href)

So here we can also know that when getting the attribute value, we can directly a.attr (attribute name) or a.attr.attribute name

Get the text
In many cases, we need to obtain the text information contained in the html tag. We can obtain the text information through .text()

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
a = doc(&#39;.item-0.active a&#39;)
print(a)
print(a.text())

The result is as follows:

Crawler parsing method four: PyQuery

Get html


We can get the html information contained in the current tag through .html(). The example is as follows:

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
li = doc(&#39;.item-0.active&#39;)
print(li)
print(li.html())

The results are as follows:

Crawler parsing method four: PyQuery

DOM operation

addClass、removeClass
熟悉前端操作的话,通过这两个操作可以添加和删除属性

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
li = doc(&#39;.item-0.active&#39;)
print(li)
li.removeClass(&#39;active&#39;)
print(li)
li.addClass(&#39;active&#39;)
print(li)

attr,css
同样的我们可以通过attr给标签添加和修改属性,
如果之前没有该属性则是添加,如果有则是修改
我们也可以通过css添加一些css属性,这个时候,标签的属性里会多一个style属性

html = &#39;&#39;&#39;
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
&#39;&#39;&#39;
from pyquery import PyQuery as pq
doc = pq(html)
li = doc(&#39;.item-0.active&#39;)
print(li)
li.attr(&#39;name&#39;, &#39;link&#39;)
print(li)
li.css(&#39;font-size&#39;, &#39;14px&#39;)
print(li)

结果如下:

Crawler parsing method four: PyQuery

 

remove
有时候我们获取文本信息的时候可能并列的会有一些其他标签干扰,这个时候通过remove就可以将无用的或者干扰的标签直接删除,从而方便操作

html = &#39;&#39;&#39;<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>&#39;&#39;&#39;from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc(&#39;.wrap&#39;)
print(wrap.text())
wrap.find(&#39;p&#39;).remove()
print(wrap.text())

结果如下:

Crawler parsing method four: PyQuery

The above is the detailed content of Crawler parsing method four: PyQuery. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:csdn.net. If there is any infringement, please contact admin@php.cn delete