Home  >  Article  >  Backend Development  >  Solution to the problem of Chinese truncation using iconv in PHP_PHP Tutorial

Solution to the problem of Chinese truncation using iconv in PHP_PHP Tutorial

WBOY
WBOYOriginal
2016-07-13 10:07:10972browse

Solution to the problem of Chinese truncation in PHP using iconv

This article mainly introduces the solution to the problem of Chinese truncation in PHP using iconv, and analyzes the occurrence of Chinese in more detail in the form of examples. The causes and specific solutions to the truncation problem have certain reference value. Friends in need can refer to it

The example in this article describes the solution to the problem of Chinese truncation using iconv in PHP. Share it with everyone for your reference. The specific analysis is as follows:

Today I made a collection program. The principle is very simple. Use the curl method to obtain and analyze the html of the other party's page, and then extract the required data and save it in the database.

Because the other party’s page is encoded in GB2312, while the local one uses UTF-8 encoding. Therefore, encoding conversion is required after collection.

The iconv method is used for encoding conversion

iconv — Convert a string to the required character encoding
string iconv ( string $in_charset , string $out_charset , string $str )

Convert string str from in_charset to out_charset.

The conversion method is very simple, just use the iconv method directly

?

1

2

3

$content = iconv('GB2312', 'UTF-8', $content); //$content为采集到的内容

?>

1 2

3


$content = iconv('GB2312', 'UTF-8', $content); //$content is the collected content

?>

I tested several pages and they all collected normally. However, in subsequent collections, several pages were incompletely collected.

First consider whether there is an error in the regularization, and then eliminate this problem after checking. After investigation, it was found that the content after iconv transcoding was a lot shorter than the collected content.

Check the apache log and see the prompt: Notice: iconv(): Detected an illegal character in input string.

Check the manual and see the following instructions

If you add the string //TRANSLIT after out_charset, the transliteration function will be enabled. This means that when a character cannot be represented by the target character set, it can be approximated by one or more similar characters.

1

2

3

$content = iconv('GB2312','UTF-8//IGNORE',$content);//$content为采集到的内容

?>

If you add the string //IGNORE, characters that cannot be expressed in the target character set will be silently discarded. Otherwise, str is truncated starting from the first invalid character and results in an E_NOTICE . It turns out that when iconv encounters unrecognizable content, it will truncate from the first unrecognized character and generate an E_NOTICE. Therefore the following content is discarded.

Adding //IGNORE after the output character set will only discard unrecognizable content without truncating or discarding subsequent content.

Everything is normal after modifying the program

?

2 3 $content = iconv('GB2312','UTF-8//IGNORE',$content);//$content is the collected content ?>
Tips: When using iconv, if you want to use UTF-8 encoding, please use UTF-8 instead of UTF8, because some servers with UTF8 will have problems. I hope this article will be helpful to everyone’s PHP programming design. http://www.bkjia.com/PHPjc/956984.htmlwww.bkjia.comtruehttp: //www.bkjia.com/PHPjc/956984.htmlTechArticleSolution to the problem of php using iconv Chinese truncation This article mainly introduces the solution to the problem of php using iconv Chinese truncation , a more detailed analysis of the Chinese truncation problem in the form of examples...
Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn