How to convert unicode and utf8 in php-PHP Problem-php.cn

Home

Backend Development

PHP Problem

How to convert unicode and utf8 in php

coldplay.xixi

Jul 17, 2020 am 09:49 AM

phpunicodeutf8Convert

How to convert unicode to utf8 in php: 1. Convert to utf8. If the byte of a character is less than 128, there is no need to convert. Then take the binary digits from the low to high digits of the Unicode binary, and take 6 digits at a time. ;2. To convert utf8 to unicode, extract 0100 from the first high-order byte and shift it to the left in sequence.

How to convert unicode and utf8 in php

How to convert unicode and utf8 in php:

Unicode and Utf-8 encoding Difference

Unicode is a character set, and UTF-8 is one of Unicode. Unicode is fixed-length and is double-byte, while UTF-8 is variable. For Chinese characters Say Unicode occupies 1 byte less than UTF-8. Unicode is double bytes, while Chinese characters in UTF-8 occupy three bytes.

UTF-8 encoded characters can theoretically be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can only be up to 3 bytes long. Let's take a look at the UTF-8 encoding table: The position of

U-00000000 - U-0000007F: 0xxxxxxx 
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx 
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

xxx is filled in by the binary representation of the character encoding number. The further to the right x has less special meaning, and only the shortest one is enough to express it. A multi-byte string of character encoding numbers. Note that in a multi-byte string, the number of "1"s at the beginning of the first byte is the number of bytes in the entire string. The first line starts with 0 to be compatible with ASCII encoding, which is one byte, the second line is a double-byte string, the third line is 3 bytes, such as Chinese characters, and so on. (Personally think: In fact, we can simply regard the number of 1’s in front as the number of bytes)

Related learning recommendations:PHP programming from entry to proficiency

How to convert Unicode to Utf-8

In order to convert Unicode to UTF-8, of course you must know what the difference is. Let’s take a look at how the encoding in Unicode is converted into UTF-8. In UTF-8, if the byte of a character is less than 0x80 (128), it is an ASCII character, occupying one byte, and no conversion is needed. Because UTF-8 is compatible with ASCII encoding. If the encoding of the Chinese character "you" in Unicode is "u4F60", convert it to binary to 100111101100000, and then convert it according to the UTF-8 method. Binary digits can be taken from the Unicode binary from low to high, taking 6 digits at a time. For example, the above binary digits can be taken out into the format shown below. The previous ones are filled according to the format, and any less than 8 bits are filled with 0.

unicode: 100111101100000                   4F60
utf-8:    11100100,10111101,10100000       E4BDA0

From the above, you can intuitively see the conversion between Unicode and UTF-8. Of course, after knowing the format of UTF-8, you can perform the inverse operation, which is to convert it in binary according to the format. Take it out from the corresponding position, and then convert it to the resulting Unicode character (this operation can be completed through "displacement"). For example, in the above conversion of "you", since its value is greater than 0x800 and less than 0x10000, it can be judged as three-byte storage. Then the highest bit needs to be shifted to the right by "12" bits and then according to the three-byte format, the highest bit is 11100000 (0xE0 ) or (|) to get the highest value. In the same way, the second digit is shifted to the right by "6" bits, and the binary value of the highest digit and the second digit is left. It can be calculated by performing the position (&) operation with 111111 (0x3F), and then summed with 11000000 (0x80). or (|). There is no need to shift the third bit, just take the last six bits directly (& with 111111 (ox3F)), and then OR (|) with 11000000 (0x80).

How to reverse Utf-8 back to Unicode

Of course, the conversion from UTF-8 to Unicode is also done through shifting, etc., that is, converting UTF-8 The binary number at the corresponding position in the format is extracted. In the above example, "you" is three bytes, so each byte must be processed, from high bit to low bit. In UTF-8 "you" is 11100100,10111101,10100000. Starting from the high bit, that is, the first byte 11100100 is to take out the "0100". This is very simple. Just take the AND (&) with 11111 (0x1F). From the three bytes, we can know that the highest position must be before the 12th bit. , because six digits are taken each time. Therefore, the obtained result needs to be shifted to the left by 12 bits, and the highest bit is now 0100,000000,000000. The second bit is to take out "111101", so you only need to AND (&) the second byte 10111101 and 111111 (0x3F). After shifting the result to the left by 6 bits and taking the result of the highest byte or (|), the second bit is completed, and the result is 0100,111101,000000. By analogy, the last digit is directly ANDed (&) with 111111 (0x3F), and then ORed (|) with the previous result to get the result 0100,111101,100000.

PHP code implementation

/**
 * utf8字符转换成Unicode字符
 * @param  [type] $utf8_str Utf-8字符
 * @return [type]           Unicode字符
 */
function utf8_str_to_unicode($utf8_str) {
    $unicode = 0;
    $unicode = (ord($utf8_str[0]) & 0x1F) << 12;
    $unicode |= (ord($utf8_str[1]) & 0x3F) << 6;
    $unicode |= (ord($utf8_str[2]) & 0x3F);
    return dechex($unicode);
}
/**
 * Unicode字符转换成utf8字符
 * @param  [type] $unicode_str Unicode字符
 * @return [type]              Utf-8字符
 */
function unicode_to_utf8($unicode_str) {
    $utf8_str = &#39;&#39;;
    $code = intval(hexdec($unicode_str));
    //这里注意转换出来的code一定得是整形，这样才会正确的按位操作
    $ord_1 = decbin(0xe0 | ($code >> 12));
    $ord_2 = decbin(0x80 | (($code >> 6) & 0x3f));
    $ord_3 = decbin(0x80 | ($code & 0x3f));
    $utf8_str = chr(bindec($ord_1)) . chr(bindec($ord_2)) . chr(bindec($ord_3));
    return $utf8_str;
}

Tested it

$utf8_str = &#39;我&#39;;
//这是汉字“你”的Unicode编码
$unicode_str = &#39;4f6b&#39;;
//输出 6211
echo utf8_str_to_unicode($utf8_str) . "<br/>";
//输出汉字“你”
echo unicode_str_to_utf8($unicode_str);

The above conversions are for tests of Chinese characters [which are usually non-ASCII], because if they are ASCII, It's the same thing all over again, so there's no need to spend so much effort.

The above is the detailed content of How to convert unicode and utf8 in php. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

php怎么把负数转为正整数Apr 19, 2022 pm 08:59 PM

php把负数转为正整数的方法：1、使用abs()函数将负数转为正数，使用intval()函数对正数取整，转为正整数，语法“intval(abs($number))”；2、利用“~”位运算符将负数取反加一，语法“~$number + 1”。

php怎么实现几秒后执行一个函数Apr 24, 2022 pm 01:12 PM

实现方法：1、使用“sleep(延迟秒数)”语句，可延迟执行函数若干秒；2、使用“time_nanosleep(延迟秒数,延迟纳秒数)”语句，可延迟执行函数若干秒和纳秒；3、使用“time_sleep_until(time()+7)”语句。

php字符串有没有下标Apr 24, 2022 am 11:49 AM

php字符串有下标。在PHP中，下标不仅可以应用于数组和对象，还可应用于字符串，利用字符串的下标和中括号“[]”可以访问指定索引位置的字符，并对该字符进行读写，语法“字符串名[下标值]”；字符串的下标值（索引值）只能是整数类型，起始值为0。

php怎么除以100保留两位小数Apr 22, 2022 pm 06:23 PM

php除以100保留两位小数的方法：1、利用“/”运算符进行除法运算，语法“数值 / 100”；2、使用“number_format(除法结果, 2)”或“sprintf("%.2f",除法结果)”语句进行四舍五入的处理值，并保留两位小数。

php怎么根据年月日判断是一年的第几天Apr 22, 2022 pm 05:02 PM

判断方法：1、使用“strtotime("年-月-日")”语句将给定的年月日转换为时间戳格式；2、用“date("z",时间戳)+1”语句计算指定时间戳是一年的第几天。date()返回的天数是从0开始计算的，因此真实天数需要在此基础上加1。

php怎么读取字符串后几个字符Apr 22, 2022 pm 08:31 PM

在php中，可以使用substr()函数来读取字符串后几个字符，只需要将该函数的第二个参数设置为负值，第三个参数省略即可；语法为“substr(字符串,-n)”，表示读取从字符串结尾处向前数第n个字符开始，直到字符串结尾的全部字符。

php怎么替换nbsp空格符Apr 24, 2022 pm 02:55 PM

方法：1、用“str_replace(" ","其他字符",$str)”语句，可将nbsp符替换为其他字符；2、用“preg_replace("/(\s|\&nbsp\;||\xc2\xa0)/","其他字符",$str)”语句。

php怎么查找字符串是第几位Apr 22, 2022 pm 06:48 PM

查找方法：1、用strpos()，语法“strpos("字符串值","查找子串")+1”；2、用stripos()，语法“strpos("字符串值","查找子串")+1”。因为字符串是从0开始计数的，因此两个函数获取的位置需要进行加1处理。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

1 months agoByDDD

R.E.P.O. Best Graphic Settings

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Dreamweaver CS6

Visual web development tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Dreamweaver Mac version

Visual web development tools

Hot Topics

Where is the login entrance for gmail email?

7414

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent