html - python中怎么获取某个网页元素之前的所有源码?

Question

{代码...} 比如第一个a元素:a#a1,要获取这个元素以上的所有网页源码: {代码...}

阿神 · Answer

Since your original html is not up to standard, I changed it a bit.
The following is made using lxml.

doc = '''

    
        The Dormouse's story 
     
     
        p1p1p1
            b1b1b1
         
        p2p2p2
        
            
u1u1u1
            a1a1a1
            
                a2a2a2 
                b2b2b2
                
p3p3p3
            
            a3a3a3 
         
        p4p4p4
    

'''

from lxml import html

tree = html.fromstring(doc)
a = tree.get_element_by_id("a1")
print(html.tostring(a))
print(html.tostring(tree).decode())

def dropnode(e=None):
    if e is None: return
    if e.tag == 'body': return
    nd = e.getnext()
    while nd is not None:
        nd.drop_tree()
        nd = e.getnext()
    dropnode(e.getparent())

dropnode(a)
print(html.tostring(tree).decode())

PHPz · Answer

Use bs4 to extract

PHP中文网 · Answer

from bs4 import BeautifulSoup as bs

def dropAllNextEle(eleOfBS, returnTrueOrFalseToKeepOrDropEleFunc = None):
    # 删除ele元素之后的所有节点元素(其实就是递归删除eleOfBS及由近及远历代父元素的兄弟元素);第二个参数是个函数,以第一个参数的各级兄弟元素为参数,返回true,保留ele,否则删除ele.
    if eleOfBS is None: return
    if eleOfBS.name == 'body': return
    next_siblings = eleOfBS.next_siblings
    if next_siblings:
        next_siblings_list = []
        for item in next_siblings:
            if item:
                next_siblings_list.insert(0, item)

        for item in next_siblings_list:
            if returnTrueOrFalseToKeepOrDropEleFunc:
                if not returnTrueOrFalseToKeepOrDropEleFunc(item):
                    item.replace_with('')
            else:
                item.replace_with('')

        dropAllNextEle(eleOfBS.parent, returnTrueOrFalseToKeepOrDropEleFunc)
    else:
        dropAllNextEle(eleOfBS.parent, returnTrueOrFalseToKeepOrDropEleFunc)
        
soup = bs(html_source, 'html5lib')
a1_ele = soup.find('a', id = 'a1')
dropAllNextEle(a1_ele, lambda item: type(item) == type(soup.new_string('strstr')))
print soup

阿神 · Answer

Newbie, I only learned the re module, so I only use the re module + the normal way to extract

>>> html = '''

    
        The Dormouse's story 
     
     
        p1p1p1
            b1b1b1
         
        p2p2p2
            
u1u1u1
            a1a1a1
            
                a2a2a2 
                b2b2b2
                
p3p3p3
            
            a3a3a3 
         
        p4p4p4
    

'''
>>> html_1 = re.search(r'>> html_1 = html_1.span()
>>> print(html[:html_1[0]])


    
        The Dormouse's story 
     
     
        p1p1p1
            b1b1b1
         
        p2p2p2
            
u1u1u1
            
>>>

伊谢尔伦 · Answer

be module is the most convenient

html - python中怎么获取某个网页元素之前的所有源码?

reply all(5)I'll reply