Heim > Fragen und Antworten > Hauptteil
<html> <head> <title>The Dormouse's story </title> </head> <body> <p id="p1">p1p1p1 <b id='b1'>b1b1b1</b> </p> <p id="p2">p2p2p2 <ul id='u1'>u1u1u1</ul> <a id="a1">a1a1a1</a> <div id='d1'> <a id="a2">a2a2a2 </a> <b id='b2'>b2b2b2</b> <p id='p3'>p3p3p3</p> </div> <a id="a3">a3a3a3 </a> </p> <p id="p4">p4p4p4</p> </body> </html>
比如第一个a元素:a#a1,要获取这个元素以上的所有网页源码:
<html> <head> <title>The Dormouse's story </title> </head> <body> <p id="p1">p1p1p1 <b id='b1'>b1b1b1</b> </p> <p id="p2">p2p2p2 <ul id='u1'>u1u1u1</ul> <a id="a1">a1a1a1</a> </p> </body> </html>
三叔2016-10-22 16:07:15
由于你原来的html不合规范,我改了点。 下面是用 lxml 做的。
doc = '''The Dormouse's story p1p1p1 b1b1b1
p2p2p2
p4p4p4
''' from lxml import html tree = html.fromstring(doc) a = tree.get_element_by_id("a1") print(html.tostring(a)) print(html.tostring(tree).decode()) def dropnode(e=None): if e is None: return if e.tag == 'body': return nd = e.getnext() while nd is not None: nd.drop_tree() nd = e.getnext() dropnode(e.getparent()) dropnode(a) print(html.tostring(tree).decode())