对于给定的字符串(通常是一个段落),我想替换一些单词/短语,但如果它们碰巧以某种方式被标签包围,则忽略它们。这也需要不区分大小写。
以此为例:
You can find a link here <a href="#">link</a> and a lot of things in different styles. Public platform can appear in bold: <b>public platform</b>, and we also have italics here too: <i>italics</i>. While I like soft pillows I am picky about soft <i>pillows</i>. While I want to find fox, I din't want foxes to show up. The text "shiny fruits" is in a span tag: one of the <span>shiny fruits</span>.
假设我想替换这些词:
link
:出现 2 次。第一个是纯文本(匹配),第二个是 A
标记(忽略)公共平台
:纯文本(匹配,不区分大小写),B
标记中的第二个(忽略)softpillows
:1 个纯文本匹配。fox
:1 个纯文本匹配。它查看完整的单词。fruits
:纯文本(匹配),span
标记中的第二个(忽略)与其他文本作为背景;我正在搜索短语匹配(不是单个单词)并将匹配链接到相关页面。
我想避免嵌套 HTML(粗体标签内没有链接,反之亦然)或其他错误(例如:the <a href="#">phrase <b>goes</ a> 这里</b>
)
我尝试了几种方法,例如搜索已删除 HTML 内容的经过清理的文本副本,虽然这告诉我存在匹配项,但我遇到了将其映射回原始内容的全新问题。
P粉5949413012024-03-28 12:56:47
我发现了关于正则表达式否定前瞻的提及,并且在打破我的想法之后得到这个正则表达式(假设你有VALID html标签配对)
// made function a bit ugly just to try to show how it comes together
public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
{
$string = $sourceText ?? "Inner text
You can find a link here link and a lot
of things in different styles. Public platform can appear in bold:
public platform, and we also have italics here too: italics.
While I like soft pillows I am picky about soft pillows.
While I want to find fox, I din't want foxes to show up.
The text \"shiny fruits\" is in a span tag: one of the shiny fruits.
The inner text like this inner inner text here to test too, event inner text
omg thats sad... or not
";
// it would be nice to use [[:punct:]] but somehow regex thinks that < and > are also punctuation marks
$punctuation = "\.,!\?:;\|\/=\"#"; // this part might take additional attention but you get the point
$stringPart = "\b$toReplace\b";
$excludeSequence = "(?![\w\n\s>$punctuation]*?";
$excludeOutside = "$excludeSequence<\/)"; // note on closing )
$excludeTag = "$excludeSequence>)"; // note on closing )
$pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
return preg_replace($pattern, $dummyText, $string);
}
带有默认参数的示例输出
""" (REPLACED TEXT HERE)\r\n You can find a link here link and a lot \r\n of things in different styles. Public platform can appear in bold: \r\n public platform, and we also have italics here too: italics. \r\n While I like soft pillows I am picky about soft pillows. \r\n While I want to find fox, I din't want foxes to show up.\r\n The text "shiny fruits" is in a span tag: one of the shiny fruits.\r\n The (REPLACED TEXT HERE) like this inner inner text here to test too, event (REPLACED TEXT HERE)\r\n omg thats sad... or not """
现在一步一步
pillowS
,我们就不需要 pillow
)\w
单词符号、\s
空格或 \n
换行符和 允许以开始结束标记
结尾的标点符号 - 我们不需要这个匹配,这里出现了否定的先行 (?![\w\n\s>$标点符号]*?<\/)<\/)
。在这里我们可以确定匹配不会进入新标签,因为 <<
不在描述的序列中($excludeOutside
变量)$excludeTag
变量与 $excludeOutside
基本相同,但适用于 $toReplace
可以是 html 标签本身的情况,例如 一个
<<
或 >
覆盖文本,并且使用这些符号可能会导致意外行为