Home >Backend Development >PHP Tutorial >Why is my PHP DOMDocument::loadHTML() Not Handling UTF-8 Encoding Correctly?

Why is my PHP DOMDocument::loadHTML() Not Handling UTF-8 Encoding Correctly?

Barbara Streisand
Barbara StreisandOriginal
2024-12-28 00:43:10176browse

Why is my PHP DOMDocument::loadHTML() Not Handling UTF-8 Encoding Correctly?

PHP DOMDocument loadHTML Not Encoding UTF-8 Correctly

When attempting to parse HTML using DOMDocument::loadHTML(), you may encounter issues with proper UTF-8 encoding. By default, DOMDocument treats input strings as ISO-8859-1, which can lead to errors when dealing with UTF-8 data.

Solution:

To ensure correct encoding, you can employ various methods:

  • Prepend Encoding Declarations: Add an XML encoding declaration or an HTML meta charset declaration to indicate the presence of UTF-8 characters:

    $contentType = '<meta http-equiv=&quot;Content-Type&quot; content=&quot;text/html; charset=utf-8&quot;>';
    $dom->loadHTML($contentType . $profile);
  • Use SmartDOMDocument: If the input HTML may already contain declarations, use the SmartDOMDocument library to resolve potential conflicts:

    $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
  • Alternative: In PHP 8.2 , use mb_encode_numericentity() for a safer encoding option:

    $dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));

HTML5 Considerations:

DOMDocument uses an HTML4 parser. For HTML5 documents, consider using alternative HTML parsers designed for HTML5 compliance.

Example:

The following code demonstrates the use of mb_convert_encoding() to correct incorrect UTF-8 encoding:

$profile = "

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として

"; $dom = new DOMDocument(); $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8')); echo $dom->saveHTML();

The above is the detailed content of Why is my PHP DOMDocument::loadHTML() Not Handling UTF-8 Encoding Correctly?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn