Home  >  Article  >  Backend Development  >  How to Determine the True Length of a UTF-8 Encoded std::string in C ?

How to Determine the True Length of a UTF-8 Encoded std::string in C ?

Linda Hamilton
Linda HamiltonOriginal
2024-10-27 20:43:30319browse

How to Determine the True Length of a UTF-8 Encoded std::string in C  ?

Determining the True Length of a UTF-8 Encoded std::string

In C , a std::string is an array of characters, each occupying one byte of memory. However, in the case of UTF-8 encoding, a single character may be represented using a sequence of multiple bytes. This leads to a discrepancy between the length of the string as reported by str.length() and its actual length in characters.

As per the UTF-8 character encoding standard, bytes are grouped into sequences, with the first byte indicating the length of the sequence:

  • 0x00000000 - 0x0000007F: 1 byte
  • 0x00000080 - 0x000007FF: 2 bytes
  • 0x00000800 - 0x0000FFFF: 3 bytes
  • 0x00010000 - 0x001FFFFF: 4 bytes

To determine the actual length of a UTF-8 encoded std::string, you can employ the following approach:

  1. Iterate through the string character by character using the *s operator.
  2. For each character, check if the first byte (using the & operator) matches the continuation byte pattern (10xxxxxx).

If the first byte does not match the continuation pattern, increment the length count. This indicates the start of a new character sequence.

Here's an example implementation:

<code class="c++">int len = 0;
while (*s) len += (*s++ & 0xc0) != 0x80;</code>

By following this approach, you can accurately determine the true length of a UTF-8 encoded std::string, which is essential for various operations, such as character counting, string manipulation, and data parsing.

The above is the detailed content of How to Determine the True Length of a UTF-8 Encoded std::string in C ?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn