Regular expression to remove spaces between invalid HTML tags - e.g. "" should be ""

Question

I have some HTML that is messed up with spaces within tags and want to make it valid again - for example: 1>0istrueandapples>>>bananas< /div> should be converted to valid HTML, and when rendered, is expected to produce: 1>0is

P粉323050780 · Answer

There is no reasonable way to save a document as corrupted as what you posted, but assuming you replace the > and similar characters in the text with their related entities, e.g.: > ;, you can put the document you want to accept into an appropriate library, such as DomDocument which will handle the rest.

$input = <<<_E_
< div class='test' >1 > 0 is < b >true and apples >>> bananas< / div >
_E_;

$input = preg_replace([ '#<\s+#', '#loadHTML($input, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

var_dump($d->saveHTML());

Output:

string(80) "1 > 0 is true and apples >>> bananas"

P粉064448449 · Answer

This regular expression is also valid:

It divides the valid part in the HTML tag into four parts and replaces the remaining parts (spaces) with them.

Regex101 Demo

/(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g

(<)<) - Capture the opening angle bracket (section 1)
\s* - matches any whitespace
(\/?) - Capturing optional backslashes (Part 2)
\s* - matches any space after a backslash
([^<>]*\S) - captures content within tags without trailing spaces (section 3)
\s* - Matches spaces after the content and before the closing angle bracket
(>) - Capture right angle bracket (section 4)

const reg = /(<)\s*(\/?)\s*([^<>]*\S)\s*(>)/g
const str = "< div class='test' >1 > 0 is < b >true< / b > and apples >>> bananas< / div  >"
const newStr = str.replace(reg, "");
console.log(newStr);

Regular expression to remove spaces between invalid HTML tags - e.g. "</b>" should be "</b>"

reply all(2)I'll reply