Home > Article > Backend Development > How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?
Determining UCS-2 Code Points for UTF-8 Characters in PHP
The task at hand is to extract the UCS-2 code points for characters within a given UTF-8 string. To accomplish this, a custom PHP function can be defined.
Firstly, it's important to understand the UTF-8 encoding scheme. Each character is represented by a sequence of 1 to 4 bytes, depending on its Unicode code point. The ranges for each byte size are as follows:
To determine the number of bytes per character, examine the first byte:
Once the number of bytes is determined, bit manipulation can be used to extract the code point.
Custom PHP Function:
Based on the above analysis, here's a custom PHP function that takes a single UTF-8 character as input and returns its UCS-2 code point:
<code class="php">function get_ucs2_codepoint($char) { // Initialize the code point $codePoint = 0; // Get the first byte $firstByte = ord($char); // Determine the number of bytes if ($firstByte < 128) { $bytes = 1; } elseif ($firstByte < 192) { $bytes = 2; } elseif ($firstByte < 224) { $bytes = 3; } elseif ($firstByte < 240) { $bytes = 4; } else { // Invalid character return -1; } // Shift and extract code point switch ($bytes) { case 1: $codePoint = $firstByte; break; case 2: $codePoint = ($firstByte & 0x1F) << 6; $codePoint |= ord($char[1]) & 0x3F; break; case 3: $codePoint = ($firstByte & 0x0F) << 12; $codePoint |= (ord($char[1]) & 0x3F) << 6; $codePoint |= ord($char[2]) & 0x3F; break; case 4: $codePoint = ($firstByte & 0x07) << 18; $codePoint |= (ord($char[1]) & 0x3F) << 12; $codePoint |= (ord($char[2]) & 0x3F) << 6; $codePoint |= ord($char[3]) & 0x3F; break; } return $codePoint; }</code>
Example Usage:
To use the function, simply provide a UTF-8 character as input:
<code class="php">$char = "ñ"; $codePoint = get_ucs2_codepoint($char); echo "UCS-2 code point: $codePoint\n";</code>
Output:
UCS-2 code point: 241
The above is the detailed content of How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?. For more information, please follow other related articles on the PHP Chinese website!