search

Home  >  Q&A  >  body text

BeautifulSoup: Combine top-level text with classic tag lookup functionality?

I'm trying to use BeautifulSoup to extract information from a non-uniformly structured html block. I'm looking for a way to combine blocks of text between tags in the search/filter output. For example, from html:

<span>
    <strong>Description</strong>
    Section1
    <ul>
        <li>line1</li>
        <li>line2</li>
        <li>line3</li>
    </ul>
    <strong>Section2</strong>
    Content2    
</span>

I want to create an output list that ignores certain types of tags (ul and li in the example above), but captures the top-level untagged text. The closest I've found is .select(':not(ul,li)') or .find_all(['strong']), but neither of them work Captures untagged top-level text and various target tags simultaneously. The ideal behavior is this:

.find_all(['strong','UNTAGGED'])

Produces the following output:

[
<strong>Description</strong>,
Section1,
<strong>Section2</strong>,
Content2
]

P粉471207302P粉471207302494 days ago551

reply all(1)I'll reply

  • P粉905144514

    P粉9051445142023-09-16 00:38:21

    To get the output, you can first select and then select its next_sibling.

    Example
    from bs4 import BeautifulSoup
    html = '''
    <span>
        <strong>Description</strong>
        Section1
        <ul>
            <li>line1</li>
            <li>line2</li>
            <li>line3</li>
        </ul>
        <strong>Section2</strong>
        Content2    
    </span>
    '''
    soup = BeautifulSoup(html)
    
    data = []
    
    for e in soup.select('strong'):
        data.extend([e,e.next_sibling.strip()])
    
    data
    Output
    [<strong>Description</strong>,
     'Section1',
     <strong>Section2</strong>,
     'Content2']

    reply
    0
  • Cancelreply