Home >Backend Development >PHP Tutorial >Why Does PHP DOMDocument's loadHTML Fail with UTF-8 Encoding, and How Can I Fix It?

Why Does PHP DOMDocument's loadHTML Fail with UTF-8 Encoding, and How Can I Fix It?

Barbara Streisand
Barbara StreisandOriginal
2024-12-30 16:48:09839browse

Why Does PHP DOMDocument's loadHTML Fail with UTF-8 Encoding, and How Can I Fix It?

PHP DOMDocument loadHTML Cannot Encode UTF-8 Correctly

DOMDocument's loadHTML method assumes your input is encoded in ISO-8859-1, which can lead to incorrect encoding of UTF-8 characters.

The underlying parser used by DOMDocument expects HTML4 input, potentially causing challenges with HTML5 documents.

Solution:

To resolve this issue, specify the character encoding of your HTML using one of the following methods:

XML Encoding Declaration:

ContentType Header:

XML Encoding Prefix:

Workaround for Unknown HTML Content:

If you cannot make assumptions about the encoding, employ a workaround like SmartDOMDocument or the following PHP code:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

Caution for PHP 8.2 :

In PHP 8.2 , the mb_convert_encoding function will generate a deprecation warning. As an alternative:

$dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));

While not ideal, this method ensures safe encoding as all characters can be represented in ISO-8859-1.

The above is the detailed content of Why Does PHP DOMDocument's loadHTML Fail with UTF-8 Encoding, and How Can I Fix It?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn