search

Home  >  Q&A  >  body text

Replace text in a string and ignore matches in HTML tags

For a given string (usually a paragraph), I want to replace some words/phrases, but ignore them if they happen to be surrounded by tags in some way. This also needs to be case insensitive.

As an example:

You can find a link here <a href="#">link</a> and a lot 
of things in different styles. Public platform can appear in bold: 
<b>public platform</b>, and we also have italics here too: <i>italics</i>. 
While I like soft pillows I am picky about soft <i>pillows</i>. 
While I want to find fox, I din't want foxes to show up.
The text "shiny fruits" is in a span tag:  one of the <span>shiny fruits</span>.

Suppose I want to replace these words:

As background; I'm searching for phrase matches (not individual words) and linking the matches to related pages.

I want to avoid nested HTML (bold tags without links and vice versa) or other errors (eg: the <a href="# ">phrase <b>goes</ a> here</b>)

I tried a few things, such as searching for a sanitized copy of the text that had the HTML content removed, and while this told me there was a match, I ran into a whole new problem of mapping it back to the original content.

P粉676821490P粉676821490285 days ago329

reply all(1)I'll reply

  • P粉594941301

    P粉5949413012024-03-28 12:56:47

    I found a mention about regex negative lookahead and after breaking my mind I got this regex (assuming you have VALID html tags paired)

    // made function a bit ugly just to try to show how it comes together
    public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
    {
      $string = $sourceText ?? "Inner text
      You can find a link here link and a lot 
      of things in different styles. Public platform can appear in bold: 
      public platform, and we also have italics here too: italics. 
      While I like soft pillows I am picky about soft pillows. 
      While I want to find fox, I din't want foxes to show up.
      The text \"shiny fruits\" is in a span tag:  one of the shiny fruits.
      The inner text like this inner inner text  here to test too, event inner text
      omg thats sad... or not
      ";
      // it would be nice to use [[:punct:]] but somehow regex thinks that < and > are also punctuation marks
      $punctuation = "\.,!\?:;\|\/=\"#"; // this part might take additional attention but you get the point
      $stringPart = "\b$toReplace\b";
      $excludeSequence = "(?![\w\n\s>$punctuation]*?";
      $excludeOutside = "$excludeSequence<\/)"; // note on closing )
      $excludeTag = "$excludeSequence>)"; // note on closing )
      $pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
      
      return preg_replace($pattern, $dummyText, $string);
    }
    

    Example output with default parameters

    """
         (REPLACED TEXT HERE)\r\n
         You can find a link here link and a lot \r\n
         of things in different styles. Public platform can appear in bold: \r\n
         public platform, and we also have italics here too: italics. \r\n
         While I like soft pillows I am picky about soft pillows. \r\n
         While I want to find fox, I din't want foxes to show up.\r\n
         The text "shiny fruits" is in a span tag:  one of the shiny fruits.\r\n
         The (REPLACED TEXT HERE) like this inner inner text  here to test too, event (REPLACED TEXT HERE)\r\n
         omg thats sad... or not     
         """

    Now step by step

    1. No subsequent matches (if there was only pillowS, we wouldn't need pillow)
    2. If the text is followed by any length of \w word symbols, \s spaces or \n newlines and is allowed to end with a start tag Ending punctuation - We don’t need this match, there is a negative lookahead (?![\w\n\s>$Punctuation]*?<\/ )<\/). Here we can be sure that the match will not go into the new tag because << is not in the described sequence ($excludeOutside variable)
    3. The
    4. $excludeTag variable is basically the same as $excludeOutside, but applies to cases where $toReplace can be the html tag itself, such as a
    Please note that this code cannot overwrite text with << or >, and using these symbols may cause unexpected behavior

    reply
    0
  • Cancelreply