在转换非标记 URL 时保留 HTML 标记中的 URL
在 HTML 文档中,可能需要将纯文本 URL 转换为可点击的 URL链接,同时排除已包含在 HTML 标记中的 URL。这可能会带来挑战,因为许多常见的文本替换方法也会无意中定位标记的 URL。
问题陈述
以下 HTML 文本片段说明了遇到的问题:
<code class="html"><p>I need you help here.</p> <p>I want to turn this:</p> <pre class="brush:php;toolbar:false">sometext sometext http://www.somedomain.com/index.html sometext sometext
into:
sometext sometext <a href="http://somedoamai.com/index.html">www.somedomain.com/index.html</a> sometext sometext
However, the existing regex solution also targets URLs within img tags:
sometext sometext <img src="http//domain.com/image.jpg"> sometext sometext
Converting this accidentally produces:
sometext sometext <img src="<a href="http//domain.com/image.jpg">domain.com/image.jpg</a>"> sometext sometext**Solution** To effectively isolate and replace URLs that are not within HTML tags, we can leverage XPath and DOM manipulation. Using an XPath query, we can select text nodes containing URLs while excluding those that are descendants of anchor tags:
$texts = $xPath->query(
'/html/body//text()[ not(ancestor::a) and ( contains(.,"http://") or contains(.,"https://") or contains(.,"ftp://") )]'
);
Once these text nodes are identified, we can replace them with document fragments containing the appropriate anchor elements. This ensures that the URLs are converted without affecting the surrounding HTML structure:
foreach ($texts as $text) {
$fragment = $dom->createDocumentFragment(); $fragment->appendXML( preg_replace( "~((?:http|https|ftp)://(?:\S*?\.\S*?))(?=\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)~i", '<a href=""></a>', $text->data ) ); $text->parentNode->replaceChild($fragment, $text);
}
以上是如何将纯文本 URL 转换为 HTML 中的可点击链接,同时保留标签内的 URL?的详细内容。更多信息请关注PHP中文网其他相关文章!