搜尋

首頁  >  問答  >  主體

替換字串中的文字並忽略 HTML 標記中的匹配項

對於給定的字串(通常是一個段落),我想替換一些單字/短語,但如果它們碰巧以某種方式被標籤包圍,則忽略它們。這也需要不區分大小寫。

以此為例:

You can find a link here <a href="#">link</a> and a lot 
of things in different styles. Public platform can appear in bold: 
<b>public platform</b>, and we also have italics here too: <i>italics</i>. 
While I like soft pillows I am picky about soft <i>pillows</i>. 
While I want to find fox, I din't want foxes to show up.
The text "shiny fruits" is in a span tag:  one of the <span>shiny fruits</span>.

假設我想替換這些​​字:

作為背景;我正在搜尋短語匹配(不是單字)並將匹配連結到相關頁面。

我想避免巢狀HTML(粗體標籤內沒有連結,反之亦然)或其他錯誤(例如:the <a href="# ">phrase <b>goes</ a> 這裡</b>)

我嘗試了幾種方法,例如搜尋已刪除 HTML 內容的經過清理的文字副本,雖然這告訴我存在匹配項,但我遇到了將其映射回原始內容的全新問題。

P粉676821490P粉676821490285 天前330

全部回覆(1)我來回復

  • P粉594941301

    P粉5949413012024-03-28 12:56:47

    我發現了關於正規表示式否定前瞻的提及,並且在打破我的想法之後得到這個正規表示式(假設你有VALID html標籤配對)

    // made function a bit ugly just to try to show how it comes together
    public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
    {
      $string = $sourceText ?? "Inner text
      You can find a link here link and a lot 
      of things in different styles. Public platform can appear in bold: 
      public platform, and we also have italics here too: italics. 
      While I like soft pillows I am picky about soft pillows. 
      While I want to find fox, I din't want foxes to show up.
      The text \"shiny fruits\" is in a span tag:  one of the shiny fruits.
      The inner text like this inner inner text  here to test too, event inner text
      omg thats sad... or not
      ";
      // it would be nice to use [[:punct:]] but somehow regex thinks that < and > are also punctuation marks
      $punctuation = "\.,!\?:;\|\/=\"#"; // this part might take additional attention but you get the point
      $stringPart = "\b$toReplace\b";
      $excludeSequence = "(?![\w\n\s>$punctuation]*?";
      $excludeOutside = "$excludeSequence<\/)"; // note on closing )
      $excludeTag = "$excludeSequence>)"; // note on closing )
      $pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
      
      return preg_replace($pattern, $dummyText, $string);
    }
    

    帶有預設參數的範例輸出

    """
         (REPLACED TEXT HERE)\r\n
         You can find a link here link and a lot \r\n
         of things in different styles. Public platform can appear in bold: \r\n
         public platform, and we also have italics here too: italics. \r\n
         While I like soft pillows I am picky about soft pillows. \r\n
         While I want to find fox, I din't want foxes to show up.\r\n
         The text "shiny fruits" is in a span tag:  one of the shiny fruits.\r\n
         The (REPLACED TEXT HERE) like this inner inner text  here to test too, event (REPLACED TEXT HERE)\r\n
         omg thats sad... or not     
         """

    現在一步一步

    1. 沒有後續符合(如果只有 pillowS,我們就不需要 pillow#)
    2. 如果文字後面跟著任意長度的\w 單字符號、\s 空格或\n 換行符號和允許以開始結束標記 結尾的標點符號 - 我們不需要這個匹配,這裡出現了否定的先行(?![\w\n\s>$標點符號]*?<\/ )<\/)。在這裡我們可以確定匹配不會進入新標籤,因為 << 不在描述的序列中($excludeOutside 變數)
    3. $excludeTag 變數與$excludeOutside 基本上相同,但適用於$toReplace 可以是html 標籤本身的情況,例如一個
    請注意,此程式碼無法使用 <<> 覆寫文本,並且使用這些符號可能會導致意外行為

    回覆
    0
  • 取消回覆