Home >Backend Development >PHP Tutorial >Why is my PHP DOMDocument::loadHTML() not handling UTF-8 correctly?

Why is my PHP DOMDocument::loadHTML() not handling UTF-8 correctly?

Barbara Streisand
Barbara StreisandOriginal
2024-12-25 12:12:14929browse

Why is my PHP DOMDocument::loadHTML() not handling UTF-8 correctly?

PHP DOMDocument loadHTML Not Encoding UTF-8 Correctly

Problem:

When parsing HTML with PHP's DOMDocument::loadHTML(), UTF-8 characters are not interpreted correctly, leading to distorted output.

Cause:

DOMDocument assumes the input string is in ISO-8859-1 encoding by default. However, UTF-8 is commonly used in HTML5. When loading UTF-8 strings without specifying the encoding, DOMDocument misinterprets them.

Solution:

To address this issue, you need to specify the correct encoding for the input string. You have several options:

  • Prepend an XML encoding declaration: Add an declaration to the beginning of the string.
  • Use a meta charset declaration: Add a tag to the section of the document.
  • Use the SmartDOMDocument library: This library works around the issue by converting the string to HTML entities before loading it into DOMDocument.
  • Use the mb_encode_numericentity() function: This function converts UTF-8 characters to their HTML entity equivalents, which DOMDocument can then parse correctly.

Example:

This code demonstrates using the mb_encode_numericentity() function:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));
echo $dom->saveHTML();

By using these techniques, you can ensure that UTF-8 characters are parsed and displayed correctly in your PHP DOMDocument.

The above is the detailed content of Why is my PHP DOMDocument::loadHTML() not handling UTF-8 correctly?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn