首页  >  问答  >  正文

html - python中怎么获取某个网页元素之前的所有源码?

<html>
    <head>
        <title>The Dormouse's story </title>
    </head> 
    <body> 
        <p id="p1">p1p1p1
            <b id='b1'>b1b1b1</b>
        </p> 
        <p id="p2">p2p2p2
            <ul id='u1'>u1u1u1</ul>
            <a id="a1">a1a1a1</a>
            <p id='d1'>
                <a id="a2">a2a2a2 </a>
                <b id='b2'>b2b2b2</b>
                <p id='p3'>p3p3p3</p>
            </p>
            <a id="a3">a3a3a3 </a>
        </p> 
        <p id="p4">p4p4p4</p>
    </body>
</html>

比如第一个a元素:a#a1,要获取这个元素以上的所有网页源码:

<html>
    <head>
        <title>The Dormouse's story </title>
    </head> 
    <body> 
        <p id="p1">p1p1p1
            <b id='b1'>b1b1b1</b>
        </p> 
        <p id="p2">p2p2p2
            <ul id='u1'>u1u1u1</ul>
            <a id="a1">a1a1a1</a>
        </p>
    </body>
</html>
天蓬老师天蓬老师2742 天前927

全部回复(5)我来回复

  • 阿神

    阿神2017-04-18 09:49:08

    由于你原来的html不合规范,我改了点。
    下面是用 lxml 做的。

    doc = '''
    <html>
        <head>
            <title>The Dormouse's story </title>
        </head> 
        <body> 
            <p id="p1">p1p1p1
                <b id='b1'>b1b1b1</b>
            </p> 
            <p id="p2">p2p2p2</p>
            <p id='d1'>
                <ul id='u1'>u1u1u1</ul>
                <a id="a1">a1a1a1</a>
                <p id='d2'>
                    <a id="a2">a2a2a2 </a>
                    <b id='b2'>b2b2b2</b>
                    <p id='p3'>p3p3p3</p>
                </p>
                <a id="a3">a3a3a3 </a>
            </p> 
            <p id="p4">p4p4p4</p>
        </body>
    </html>
    '''
    
    from lxml import html
    
    tree = html.fromstring(doc)
    a = tree.get_element_by_id("a1")
    print(html.tostring(a))
    print(html.tostring(tree).decode())
    
    def dropnode(e=None):
        if e is None: return
        if e.tag == 'body': return
        nd = e.getnext()
        while nd is not None:
            nd.drop_tree()
            nd = e.getnext()
        dropnode(e.getparent())
    
    dropnode(a)
    print(html.tostring(tree).decode()) 

    回复
    0
  • PHPz

    PHPz2017-04-18 09:49:08

    使用bs4去提取

    回复
    0
  • PHP中文网

    PHP中文网2017-04-18 09:49:08

    雷雷

    回复
    0
  • 阿神

    阿神2017-04-18 09:49:08

    新手,我只学了re模块,所以只用re模块+普通方式来提取

    >>> html = '''
    <html>
        <head>
            <title>The Dormouse's story </title>
        </head> 
        <body> 
            <p id="p1">p1p1p1
                <b id='b1'>b1b1b1</b>
            </p> 
            <p id="p2">p2p2p2
                <ul id='u1'>u1u1u1</ul>
                <a id="a1">a1a1a1</a>
                <p id='d1'>
                    <a id="a2">a2a2a2 </a>
                    <b id='b2'>b2b2b2</b>
                    <p id='p3'>p3p3p3</p>
                </p>
                <a id="a3">a3a3a3 </a>
            </p> 
            <p id="p4">p4p4p4</p>
        </body>
    </html>
    '''
    >>> html_1 = re.search(r'<a',html)
    >>> html_1 = html_1.span()
    >>> print(html[:html_1[0]])
    
    <html>
        <head>
            <title>The Dormouse's story </title>
        </head> 
        <body> 
            <p id="p1">p1p1p1
                <b id='b1'>b1b1b1</b>
            </p> 
            <p id="p2">p2p2p2
                <ul id='u1'>u1u1u1</ul>
                
    >>> 

    回复
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 09:49:08

    be模块最顺手

    回复
    0
  • 取消回复