登录

python中怎么获取某个网页元素之前的所有源码?

<html>
    <head>
        <title>The Dormouse's story </title>
    </head> 
    <body> 
        <p id="p1">p1p1p1
            <b id='b1'>b1b1b1</b>
        </p> 
        <p id="p2">p2p2p2
            <ul id='u1'>u1u1u1</ul>
            <a id="a1">a1a1a1</a>
            <div id='d1'>
                <a id="a2">a2a2a2 </a>
                <b id='b2'>b2b2b2</b>
                <p id='p3'>p3p3p3</p>
            </div>
            <a id="a3">a3a3a3 </a>
        </p> 
        <p id="p4">p4p4p4</p>
    </body>
</html>

比如第一个a元素:a#a1,要获取这个元素以上的所有网页源码:

<html>
    <head>
        <title>The Dormouse's story </title>
    </head> 
    <body> 
        <p id="p1">p1p1p1
            <b id='b1'>b1b1b1</b>
        </p> 
        <p id="p2">p2p2p2
            <ul id='u1'>u1u1u1</ul>
            <a id="a1">a1a1a1</a>
        </p>
    </body>
</html>


# HTML
高洛峰 高洛峰 2714 天前 634 次浏览

全部回复(1) 我要回复

  • 三叔

    三叔2016-10-22 16:07:15

    由于你原来的html不合规范,我改了点。 下面是用 lxml 做的。

    doc = '''
    
        
            The Dormouse's story 
         
         
            p1p1p1
                b1b1b1
            

              p2p2p2

                         u1u1u1             a1a1a1                              a2a2a2                  b2b2b2                 p3p3p3

                
                a3a3a3          
  •           p4p4p4

         ''' from lxml import html tree = html.fromstring(doc) a = tree.get_element_by_id("a1") print(html.tostring(a)) print(html.tostring(tree).decode()) def dropnode(e=None):     if e is None: return     if e.tag == 'body': return     nd = e.getnext()     while nd is not None:         nd.drop_tree()         nd = e.getnext()     dropnode(e.getparent()) dropnode(a) print(html.tostring(tree).decode())


    回复
    0
  • 取消 回复 发送