Home  >  Article  >  Backend Development  >  How to use php to convert unicode and utf8?

How to use php to convert unicode and utf8?

coldplay.xixi
coldplay.xixiOriginal
2020-07-10 11:03:163242browse

How to use PHP to convert unicode and utf8: First, there is no need to consider the [4-6] byte encoding; then [utf-8] characters with more than four bytes appear, which can be directly regarded as garbled characters. Just ignore it or convert it to unicode entity form, the code is [$utf8char = "{$c};"].

How to use php to convert unicode and utf8?

How to use php to convert unicode and utf8:

Unicode encoding is to implement utf-8 and gb series The basis for encoding (gb2312, gbk, gb18030) conversion. Although we can also directly make a comparison table from utf-8 to these encodings, few people will do this because the variable encoding of utf-8 is uncertain. Therefore, in general, the comparison table between unicode and gb encoding is used. Unicode (UCS-2) is actually the basic encoding of utf-8, and utf-8 is just an implementation of it. The two have the following correspondence:

  • Unicode symbol range | UTF-8 encoding method

  • ##u0000 0000 - u0000 007F | 0xxxxxxx

  • u0000 0080 - u0000 07FF | 110xxxxx 10xxxxxx

  • ##u0000 0800 - u0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
  • Due to the characters currently used by utf-8 They are all in UCS-2, so there is no need to consider the 4-6 byte encoding. Similarly, during reverse conversion, if more than four bytes of UTF-8 characters appear, they can be directly regarded as garbled characters. Ignore or convert to unicode entity form ("long int;" form), and then hand it over to the browser or related parsing program for processing. The algorithm for converting unicode to utf-8 encoding using PHP is as follows:
/*
 * 参数 $c 是unicode字符编码的int类型数值,如果是用二进制读取的数据,在php中通常要用 hexdec(bin2hex( $bin_unichar )) 这样转换
 */
function uni2utf8( $c )
{
  if ($c < 0x80)
  {
        $utf8char = chr($c);
  }
  else if ($c < 0x800)
  {
        $utf8char = chr(0xC0 | $c >> 0x06).chr(0x80 | $c & 0x3F);
  }
  else if ($c < 0x10000)
  {
        $utf8char = chr(0xE0 | $c >> 0x0C).chr(0x80 | $c >> 0x06 & 0x3F).chr(0x80 | $c & 0x3F);
  }
  //因为UCS-2只有两字节,所以后面的情况是不可能出现的,这里只是说明unicode HTML实体编码的用法。
  else
  {
        $utf8char = "&#{$c};";
  }
  return $utf8char;
}

Within the current environment, it can be considered that utf-8 character set == unicode (UCS-2), but in theory, the inclusion relationship of the main character set relationships is as follows:

utf-8 > unicode(UCS-2) > gb18030 > gbk > gb2312

Therefore, if encoding When all are correct:

gb2312 => gbk => gb18030 => unicode(UCS-2) => utf-8

Such a transformation process is basically lossless, but on the contrary, by

utf-8 => unicode(UCS-2) => gb18030=> gbk => gb2312

In such a conversion process, it is very likely that there will be unrecognizable characters. Therefore, if the system uses UTF-8 encoding, try not to reverse the encoding operation easily.

Related learning recommendations:
PHP programming from entry to proficiency

The above is the detailed content of How to use php to convert unicode and utf8?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn