Home  >  Q&A  >  body text

html - python提取标签中的内容

抓取了一个网页,网页中一部分内容如下:

我使用如下代码:

import codecs
#coding=utf-8
from lxml import etree
f=codecs.open("1.html","r","utf-8")
content=f.read()
f.close()
tree=etree.HTML(content)
node=tree.xpath("//p[@class='content']")[0]
print node.text.encoding('gbk')

但是只能输出:奥迪阿萨德,第一个之后的内容都不能输出,请问该如何解决?

阿神阿神2720 days ago440

reply all(2)I'll reply

  • 黄舟

    黄舟2017-04-17 13:11:53

    lxml's element.text returns the content of the first node of this element, so this problem will occur. You can use the getText helper method to solve this problem:

    # require lxml
    # version: python2
    def getText(elem):
        rc = []
        for node in elem.itertext():
            rc.append(node.strip())
        return ''.join(rc)
    

    You can directly modify the last line here:

    import codecs
    #coding=utf-8
    from lxml import etree
    
    def getText(elem):
        rc = []
        for node in elem.itertext():
            rc.append(node.strip())
        return ''.join(rc)
    
    f=codecs.open("1.html","r","utf-8")
    content=f.read()
    f.close()
    tree=etree.HTML(content)
    # 返回的是lxml.etree._Element,可以直接作为getText参数来调用。
    node=tree.xpath("//p[@class='content']")[0]
    print getText(node).encoding('gbk')
    
    

    The getText here is just a simple implementation. For example, the following xml text will print abdc, which should meet your requirements.

    <p class="content">
        a<em>b <em>d</em></em>c
    </p>
    

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-17 13:11:53

    #!/usr/bin/env python3
    from bs4 import BeautifulSoup
    
    f = open("1.html", "r")
    html = BeautifulSoup( f.read() )
    node = html.select(".content")[0]
    print( node.prettify() )
    

    html.select(".content")This may need more selectors to qualify. In addition, I just roughly wrote how BeautifulSoup works. For specific needs, you can check the manual: Beautiful Soup Document

    reply
    0
  • Cancelreply