I'm trying to use BeautifulSoup to extract information from a non-uniformly structured html block. I'm looking for a way to combine blocks of text between tags in the search/filter output. For example, from html:
<span> <strong>Description</strong> Section1 <ul> <li>line1</li> <li>line2</li> <li>line3</li> </ul> <strong>Section2</strong> Content2 </span>
I want to create an output list that ignores certain types of tags (ul
and li
in the example above), but captures the top-level untagged text. The closest I've found is .select(':not(ul,li)')
or .find_all(['strong'])
, but neither of them work Captures untagged top-level text and various target tags simultaneously. The ideal behavior is this:
.find_all(['strong','UNTAGGED'])
Produces the following output:
[ <strong>Description</strong>, Section1, <strong>Section2</strong>, Content2 ]
P粉9051445142023-09-16 00:38:21
To get the output, you can first select and then select its
next_sibling
.
from bs4 import BeautifulSoup html = ''' <span> <strong>Description</strong> Section1 <ul> <li>line1</li> <li>line2</li> <li>line3</li> </ul> <strong>Section2</strong> Content2 </span> ''' soup = BeautifulSoup(html) data = [] for e in soup.select('strong'): data.extend([e,e.next_sibling.strip()]) data
[<strong>Description</strong>, 'Section1', <strong>Section2</strong>, 'Content2']