Home >Backend Development >PHP Tutorial >How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

DDDOriginal: 2024-10-31 18:00:15405browse

Determining UCS-2 Code Points for UTF-8 Characters in PHP

The task at hand is to extract the UCS-2 code points for characters within a given UTF-8 string. To accomplish this, a custom PHP function can be defined.

Firstly, it's important to understand the UTF-8 encoding scheme. Each character is represented by a sequence of 1 to 4 bytes, depending on its Unicode code point. The ranges for each byte size are as follows:

0xxxxxxx: 1 byte
110xxxxx 10xxxxxx: 2 bytes
1110xxxx 10xxxxxx 10xxxxxx: 3 bytes
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4 bytes

To determine the number of bytes per character, examine the first byte:

0: 1 byte character
110: 2 byte character
1110: 3 byte character
11110: 4 byte character
10: Continuation byte
11111: Invalid character

Once the number of bytes is determined, bit manipulation can be used to extract the code point.

Custom PHP Function:

Based on the above analysis, here's a custom PHP function that takes a single UTF-8 character as input and returns its UCS-2 code point:

<code class="php">function get_ucs2_codepoint($char)
{
    // Initialize the code point
    $codePoint = 0;

    // Get the first byte
    $firstByte = ord($char);

    // Determine the number of bytes
    if ($firstByte < 128) {
        $bytes = 1;
    } elseif ($firstByte < 192) {
        $bytes = 2;
    } elseif ($firstByte < 224) {
        $bytes = 3;
    } elseif ($firstByte < 240) {
        $bytes = 4;
    } else {
        // Invalid character
        return -1;
    }

    // Shift and extract code point
    switch ($bytes) {
        case 1:
            $codePoint = $firstByte;
            break;
        case 2:
            $codePoint = ($firstByte & 0x1F) << 6;
            $codePoint |= ord($char[1]) & 0x3F;
            break;
        case 3:
            $codePoint = ($firstByte & 0x0F) << 12;
            $codePoint |= (ord($char[1]) & 0x3F) << 6;
            $codePoint |= ord($char[2]) & 0x3F;
            break;
        case 4:
            $codePoint = ($firstByte & 0x07) << 18;
            $codePoint |= (ord($char[1]) & 0x3F) << 12;
            $codePoint |= (ord($char[2]) & 0x3F) << 6;
            $codePoint |= ord($char[3]) & 0x3F;
            break;
    }

    return $codePoint;
}</code>

Example Usage:

To use the function, simply provide a UTF-8 character as input:

<code class="php">$char = "ñ";
$codePoint = get_ucs2_codepoint($char);
echo "UCS-2 code point: $codePoint\n";</code>

Output:

UCS-2 code point: 241

The above is the detailed content of How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?. For more information, please follow other related articles on the PHP Chinese website!

php String for number function this input

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：How Do I Run PHP Scripts from the Command Line on Windows?Next article：How Do I Run PHP Scripts from the Command Line on Windows?

See more

How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

Related articles