For a given string (usually a paragraph), I want to replace some words/phrases, but ignore them if they happen to be surrounded by tags in some way. This also needs to be case insensitive.
As an example:
You can find a link here <a href="#">link</a> and a lot of things in different styles. Public platform can appear in bold: <b>public platform</b>, and we also have italics here too: <i>italics</i>. While I like soft pillows I am picky about soft <i>pillows</i>. While I want to find fox, I din't want foxes to show up. The text "shiny fruits" is in a span tag: one of the <span>shiny fruits</span>.
Suppose I want to replace these words:
link
: Appears 2 times. The first is plain text (matches), the second is A
tags (ignores) Public platform
: plain text (match, case insensitive), second in B
tags (ignored) softpillows
: 1 plain text match. fox
: 1 plain text match. It views complete words. fruits
: plain text (matched), second in span
tags (ignored) with other text As background; I'm searching for phrase matches (not individual words) and linking the matches to related pages.
I want to avoid nested HTML (bold tags without links and vice versa) or other errors (eg: the <a href="# ">phrase <b>goes</ a> here</b>
)
I tried a few things, such as searching for a sanitized copy of the text that had the HTML content removed, and while this told me there was a match, I ran into a whole new problem of mapping it back to the original content.
P粉5949413012024-03-28 12:56:47
I found a mention about regex negative lookahead and after breaking my mind I got this regex (assuming you have VALID html tags paired)
// made function a bit ugly just to try to show how it comes together
public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
{
$string = $sourceText ?? "Inner text
You can find a link here link and a lot
of things in different styles. Public platform can appear in bold:
public platform, and we also have italics here too: italics.
While I like soft pillows I am picky about soft pillows.
While I want to find fox, I din't want foxes to show up.
The text \"shiny fruits\" is in a span tag: one of the shiny fruits.
The inner text like this inner inner text here to test too, event inner text
omg thats sad... or not
";
// it would be nice to use [[:punct:]] but somehow regex thinks that < and > are also punctuation marks
$punctuation = "\.,!\?:;\|\/=\"#"; // this part might take additional attention but you get the point
$stringPart = "\b$toReplace\b";
$excludeSequence = "(?![\w\n\s>$punctuation]*?";
$excludeOutside = "$excludeSequence<\/)"; // note on closing )
$excludeTag = "$excludeSequence>)"; // note on closing )
$pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
return preg_replace($pattern, $dummyText, $string);
}
Example output with default parameters
""" (REPLACED TEXT HERE)\r\n You can find a link here link and a lot \r\n of things in different styles. Public platform can appear in bold: \r\n public platform, and we also have italics here too: italics. \r\n While I like soft pillows I am picky about soft pillows. \r\n While I want to find fox, I din't want foxes to show up.\r\n The text "shiny fruits" is in a span tag: one of the shiny fruits.\r\n The (REPLACED TEXT HERE) like this inner inner text here to test too, event (REPLACED TEXT HERE)\r\n omg thats sad... or not """
Now step by step
pillowS
, we wouldn't need pillow
) \w
word symbols, \s
spaces or \n
newlines and is allowed to end with a start tag
Ending punctuation - We don’t need this match, there is a negative lookahead (?![\w\n\s>$Punctuation]*?<\/ )<\/)
. Here we can be sure that the match will not go into the new tag because <<
is not in the described sequence ($excludeOutside
variable) $excludeTag
variable is basically the same as $excludeOutside
, but applies to cases where $toReplace
can be the html tag itself, such as a
<<
or >
, and using these symbols may cause unexpected behavior