search

Home  >  Q&A  >  body text

Regular expression to remove spaces between invalid HTML tags - e.g. "</b>" should be "</b>"

<p>I have some HTML that is messed up by spaces within tags and want to make it valid again - for example: </p> <pre class="brush:php;toolbar:false;">< div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div ></pre> <p> should be converted to valid HTML, and when rendered, is expected to produce: </p> <p> <pre class="snippet-code-html lang-html prettyprint-override"><code><div class='test'>1 > 0 is <b>true</b> and apples >>> bananas</div></code></pre> </p> <p>Any text preceded/followed by spaces in <code>></code> or </code>><</code> should remain unchanged - for example, <code> ;1 > 0</code> should be retained instead of being compressed to <code>1>0</code></p > <p>I realize this may require several regular expressions, which is fine</p> <p>I have a few things:</p> <p><code><\s?\/\s*</code> This will partially fix <code></ b></ div ></code> to< code></b></div ></code> but I'm working on the rest< /p> <p>For example, I could take a drastic approach, but that would also break the code within the label text portion, not the label name itself</p>
P粉884667022P粉884667022438 days ago491

reply all(2)I'll reply

  • P粉323050780

    P粉3230507802023-09-03 16:42:37

    There is no reasonable way to save a document as corrupted as what you posted, but assuming you replace the > and similar characters in the text with their related entities, e.g.: > ;, you can put the document you want to accept into an appropriate library, such as DomDocument which will handle the rest.

    $input = <<<_E_
    < div class='test' >1 > 0 is < b >true</ b> and apples >>> bananas< / div >
    _E_;
    
    $input = preg_replace([ '#<\s+#', '#</\s+#' ], [ '<', '</' ], $input);
    
    $d = new DomDocument();
    $d->loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    
    var_dump($d->saveHTML());
    

    Output:

    string(80) "<div class="test">1 > 0 is <b>true</b> and apples >>> bananas</div>"
    

    reply
    0
  • P粉064448449

    P粉0644484492023-09-03 11:17:47

    This regular expression is also valid:

    It divides the valid part in the HTML tag into four parts and replaces the remaining parts (spaces) with them.

    Regex101 Demo

    /(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g

    • (<)<) - Capture the opening angle bracket (section 1)
    • \s* - matches any whitespace
    • (\/?) - Capturing optional backslashes (Part 2)
    • \s* - matches any space after a backslash
    • ([^<>]*\S) - captures content within tags without trailing spaces (section 3)
    • \s* - Matches spaces after the content and before the closing angle bracket
    • (>) - Capture right angle bracket (section 4)

    const reg = /(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g
    const str = "< div class='test' >1 > 0 is < b >true< / b > and apples >>> bananas< / div  >"
    const newStr = str.replace(reg, "");
    console.log(newStr);

    reply
    0
  • Cancelreply