Home >Backend Development >PHP Tutorial >How to Extract UCS-2 Code Points from UTF-8 Strings?

How to Extract UCS-2 Code Points from UTF-8 Strings?

Barbara Streisand
Barbara StreisandOriginal
2024-11-01 17:45:30675browse

How to Extract UCS-2 Code Points from UTF-8 Strings?

Determining UCS-2 Code Points for UTF-8 Characters

In various programming scenarios, it may be necessary to extract the UCS-2 code points associated with characters within a UTF-8 string. To address this requirement, it is prudent to leverage built-in utilities or delve into the complexities of the UTF-8 encoding format.

UTF-8 encodes characters using a variable-length byte sequence. Each code point is represented by 1 to 4 bytes, depending on its value. The following ranges apply:

  • U 0000 — U 007F: 1 byte (0xxxxxxx)
  • U 0080 — U 07FF: 2 bytes (110xxxxx 10xxxxxx)
  • U 0800 — U FFFF: 3 bytes (1110xxxx 10xxxxxx 10xxxxxx)
  • U 10000 — U 10FFFF: 4 bytes (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx)

To determine the number of bytes in a code point, examine the first byte:

  • 0x00: 1 byte
  • 0xC0: 2 bytes
  • 0xE0: 3 bytes
  • 0xF0: 4 bytes
  • 0x10: Continuation byte
  • 0x11111: Invalid character

Once the byte count is known, the code point can be extracted through bit manipulation. Note that UCS-2 has a limited range and cannot represent characters above U FFFF.

The above is the detailed content of How to Extract UCS-2 Code Points from UTF-8 Strings?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn