I need to split a string by a specific number of tags (<li>, <lu> ...)
. I figured out the regular expression
pattern = <li>|<ul>|<ol>|<li>|<dl>|<dt>|<dd>|<h1>|< h2>| <h3>|<h4>|<h5>|<h6>
and re.split
Basically it gets the job done
test_string = '<p> Some text some text some text. </p> <p> Another text another text </p>. <li> some list </li>. <ul> another list </ul>' res = re.search(test_string, pattern) -> `['<p> Some text some text some text. </p> <p> Another text another text </p>. ', ' some list </li>. ', ' another list </ul>']`
But I want to capture the opening and closing tags and keep the tags in the split text. Something similar
['<p> Some text some text some text. </p> <p> Another text another text </p>. ', '<li> some list </li>. ', '<ul>another list </ul>']`
P粉7878060242024-04-01 10:26:40
To answer your specific questions:
<(p|li|ul|ol|dl|h1|h2|h3|h4|h5|h6)>[^<]*>
And match instead of split.
\1
refers to what is captured in the opening tag.
Similar to:
for match in re.finditer(r"<(p|li|ul|ol|dl|h1|h2|h3|h4|h5|h6)>[^<]*>", subject, re.DOTALL):
However, in most real cases this is not sufficient to handle HTML and you should consider a DOM parser.