首页 >后端开发 >php教程 >如何截断包含 HTML 的文本,同时确保正确的标签闭合?

如何截断包含 HTML 的文本,同时确保正确的标签闭合?

Mary-Kate Olsen
Mary-Kate Olsen原创
2024-11-12 03:39:01611浏览

How to Truncate Text Containing HTML While Ensuring Correct Tag Closure?

忽略标签时截断包含 HTML 的文本

尝试截断包含 HTML 的文本时,通常会遇到标签未正确关闭的问题,导致截断结果失真。为了克服这个问题,有必要有效地解析 HTML 并处理标签。

这是一种基于 PHP 的方法,可确保标签在截断期间正确关闭:

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // Regex pattern for matching HTML tags, entities, and UTF-8 characters
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&amp;#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
        : '{</?([a-z]+)[^>]*>|&amp;#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        // 1. Handle text leading up to the tag
        $str = substr($html, $position, $match[0][1] - $position);
        if ($printedLength + strlen($str) <= $maxLength)
        {
            print($str);
            $printedLength += strlen($str);
        }
        else
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        // 2. Handle the tag
        $tag = $match[0][0];
        if ($tag[0] == '&amp;' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 character through unchanged
            print($tag);
            $printedLength++;
        }
        else
        {
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // Closing tag
                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // Ensure proper tag nesting
                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag
                print($tag);
            }
            else
            {
                // Opening tag
                print($tag);
                $tags[] = $tagName;
            }
        }

        $position = $match[0][1] + strlen($tag);
    }

    // 3. Print remaining text
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // 4. Close any open tags
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}

为了说明其功能:

printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); // Output: <b><Hello></b> <img

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); // Output: <table><tr><td>Heck

printTruncated(10, "<em><b>Hello</b>&amp;#20;w\xC3\xB8rld!</em>"); // Output: <em><b>Hello</b> w

以上是如何截断包含 HTML 的文本,同时确保正确的标签闭合?的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn