Home >Backend Development >PHP Tutorial >How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

Linda Hamilton
Linda HamiltonOriginal
2024-11-03 02:09:29493browse

How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

Converting UTF-8 Characters to UCS-2 Code Points

In this article, we explore how to extract the UCS-2 code points of characters within a UTF-8 string. We will provide a detailed explanation of the process and an implementation in PHP versions 4 or 5.

Understanding UTF-8

UTF-8 is a character encoding standard that represents Unicode characters using one to four bytes. To determine the number of bytes for a particular character, examine the leading byte:

  • 0xxxxxxx: 1-byte character
  • 110xxxxx: 2-byte character
  • 1110xxxx: 3-byte character
  • 11110xxx: 4-byte character

Converting to UCS-2

UCS-2, also known as UTF-16, is a character encoding format that can represent most Unicode characters. The conversion from UTF-8 to UCS-2 considers the number of bytes per character as follows:

  • 1-byte character: The code point is directly the UTF-8 byte value.
  • 2-byte character: Shift the first byte left by 6 bits and bitwise OR it with the second byte.
  • 3-byte character: Shift the first byte left by 12 bits, the second byte left by 6 bits, and bitwise OR them with the third byte.

Implementation in PHP 4/5

For PHP versions 4 or 5, you can implement a function to perform this conversion:

<code class="php">function utf8_char_to_ucs2($utf8) {
    if (!(ord($utf8[0]) & 0x80)) {
        return ord($utf8[0]);
    } elseif ((ord($utf8[0]) & 0xE0) == 0xC0) {
        return ((ord($utf8[0]) & 0x1F) << 6) | (ord($utf8[1]) & 0x3F);
    } elseif ((ord($utf8[0]) & 0xF0) == 0xE0) {
        return ((ord($utf8[0]) & 0x0F) << 12) | ((ord($utf8[1]) & 0x3F) << 6) | (ord($utf8[2]) & 0x3F);
    } else {
        return null; // Handle invalid characters or characters beyond UCS-2 range
    }
}</code>

Example Usage

<code class="php">$utf8 = "hello";
for ($i = 0; $i < strlen($utf8); $i++) {
    $ucs2_codepoint = utf8_char_to_ucs2($utf8[$i]);
    printf("Code point for '%s': %d\n", $utf8[$i], $ucs2_codepoint);
}</code>

This will output:

Code point for 'h': 104
Code point for 'e': 101
Code point for 'l': 108
Code point for 'l': 108
Code point for 'o': 111

The above is the detailed content of How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn