Home  >  Article  >  Backend Development  >  PHP parsing html class library simple_html_dom transcoding bug_PHP tutorial

PHP parsing html class library simple_html_dom transcoding bug_PHP tutorial

WBOY
WBOYOriginal
2016-07-13 10:29:32927browse

I have been using simple_html_dom to capture some articles these days. The encoding of different websites in China is basically gbk gb2312 utf-8. Most of them are gb2312 and utf-8.

My version of simple_html_dom has a method convert_text that looks like this.

Copy code The code is as follows:

// PaperG - Function to convert the text from one character set to another if the two sets are not the same.
function convert_text($text)
{
global $debug_object;
if (is_object($debug_object)) {$debug_object->debug_log_entry(1);}
$converted_text = $text;
$sourceCharset = "";
$targetCharset = "";
if ($this->dom)
{
$sourceCharset = strtoupper( $this->dom->_charset);
$targetCharset = strtoupper($this->dom->_target_charset);
}
if (is_object($debug_object)) {$debug_object ->debug_log(3, "source charset: " . $sourceCharset . " target charaset: " . $targetCharset);}
if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($ sourceCharset, $targetCharset) != 0))
{
// Check if the reported encoding could have been incorrect and the text is actually already UTF-8
if ((strcasecmp($targetCharset, 'UTF -8') == 0) && ($this->is_utf8($text)))
{
$converted_text = $text;
}
else
{
$converted_text = iconv($sourceCharset, $targetCharset, $text);
}
}
// Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.
if ($targetCharset == 'UTF-8')
{
if (substr($converted_text, 0, 3) == "xefxbbxbf")
{
$converted_text = substr($converted_text, 3);
}
if (substr($converted_text, -3) == "xefxbbxbf")
{
$converted_text = substr($converted_text, 0, - 3);
}
}
return $converted_text;
}

Look at this line:

Copy code The code is as follows:

$converted_text = iconv($sourceCharset, $targetCharset, $text);

will cause incorrect transcoding. For example, the text of gb2312 will be converted into:

Copy code The code is as follows:

On April 26th in At the 2014 Longines FEI Jumping World Cup Chinese League Qualifying Tournament held at the Lian Yuanli Park Equestrian Stadium, 24-year-old Han Zhuangzhuang not only scored zero penalty points... he was the seventh to appear鍖椾HanOlympic rider Zhao Zhiwen was the first to score zero penalty points, with a time of 77 seconds 07...

It is a fait accompli, proving that the transcoding function inside is not handled properly. Since I'm using this simple_html_dom just want to build the dom. I didn't intend to take the time to properly handle this bug. Instead simply put

Copy code The code is as follows:

$converted_text = iconv($sourceCharset, $targetCharset, $text);

changed to

Copy code The code is as follows:

$converted_text = $text;

That’s it. The idea is to cancel its transcoding. Okay, don't worry about the work, you can continue.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/774994.htmlTechArticleI have been using simple_html_dom to grab some articles these days. The encoding of different websites in China is basically gbk gb2312 utf-8. Most of them are gb2312 and utf-8. My version of simple_html_dom has a method...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Previous article:smarty_PHP tutorialNext article:smarty_PHP tutorial