search

Home  >  Q&A  >  body text

Keep specific html tags when splitting string

I need to split a string by a specific number of tags (<li>, <lu> ...). I figured out the regular expression

pattern = <li>|<ul>|<ol>|<li>|<dl>|<dt>|<dd>|<h1>|< h2>| <h3>|<h4>|<h5>|<h6> and re.split

Basically it gets the job done

test_string = '<p> Some text some text some text. </p> <p> Another text another text </p>. <li> some list </li>. <ul> another list </ul>'
res = re.search(test_string, pattern) 
-> `['<p> Some text some text some text. </p> <p> Another text another text </p>. ', ' some list </li>. ', ' another list </ul>']`

But I want to capture the opening and closing tags and keep the tags in the split text. Something similar

['<p> Some text some text some text. </p> <p> Another text another text </p>. ', '<li> some list </li>. ', '<ul>another list </ul>']`

P粉841870942P粉841870942277 days ago544

reply all(1)I'll reply

  • P粉787806024

    P粉7878060242024-04-01 10:26:40

    To answer your specific questions:

    <(p|li|ul|ol|dl|h1|h2|h3|h4|h5|h6)>[^<]*

    And match instead of split.

    \1 refers to what is captured in the opening tag.

    Similar to:

    for match in re.finditer(r"<(p|li|ul|ol|dl|h1|h2|h3|h4|h5|h6)>[^<]*", subject, re.DOTALL):

    However, in most real cases this is not sufficient to handle HTML and you should consider a DOM parser.

    reply
    0
  • Cancelreply