Home  >  Article  >  Backend Development  >  How to Truncate Text Containing HTML While Ensuring Correct Tag Closure?

How to Truncate Text Containing HTML While Ensuring Correct Tag Closure?

Mary-Kate Olsen
Mary-Kate OlsenOriginal
2024-11-12 03:39:01580browse

How to Truncate Text Containing HTML While Ensuring Correct Tag Closure?

Truncating Text Containing HTML While Ignoring Tags

When attempting to truncate text containing HTML, it's common to encounter issues where tags are not closed properly, leading to distorted truncation results. To overcome this, it's necessary to parse the HTML and handle tags effectively.

Here's a PHP-based approach that ensures tags are correctly closed during truncation:

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // Regex pattern for matching HTML tags, entities, and UTF-8 characters
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&amp;#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
        : '{</?([a-z]+)[^>]*>|&amp;#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        // 1. Handle text leading up to the tag
        $str = substr($html, $position, $match[0][1] - $position);
        if ($printedLength + strlen($str) <= $maxLength)
        {
            print($str);
            $printedLength += strlen($str);
        }
        else
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        // 2. Handle the tag
        $tag = $match[0][0];
        if ($tag[0] == '&amp;' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 character through unchanged
            print($tag);
            $printedLength++;
        }
        else
        {
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // Closing tag
                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // Ensure proper tag nesting
                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag
                print($tag);
            }
            else
            {
                // Opening tag
                print($tag);
                $tags[] = $tagName;
            }
        }

        $position = $match[0][1] + strlen($tag);
    }

    // 3. Print remaining text
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // 4. Close any open tags
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}

To illustrate its functionality:

printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); // Output: <b><Hello></b> <img

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); // Output: <table><tr><td>Heck

printTruncated(10, "<em><b>Hello</b>&amp;#20;w\xC3\xB8rld!</em>"); // Output: <em><b>Hello</b> w

The above is the detailed content of How to Truncate Text Containing HTML While Ensuring Correct Tag Closure?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn