Home >Backend Development >C++ >How to Determine the True Length of a UTF-8 Encoded std::string in C ?
Determining the True Length of a UTF-8 Encoded std::string
In C , a std::string is an array of characters, each occupying one byte of memory. However, in the case of UTF-8 encoding, a single character may be represented using a sequence of multiple bytes. This leads to a discrepancy between the length of the string as reported by str.length() and its actual length in characters.
As per the UTF-8 character encoding standard, bytes are grouped into sequences, with the first byte indicating the length of the sequence:
To determine the actual length of a UTF-8 encoded std::string, you can employ the following approach:
If the first byte does not match the continuation pattern, increment the length count. This indicates the start of a new character sequence.
Here's an example implementation:
<code class="c++">int len = 0; while (*s) len += (*s++ & 0xc0) != 0x80;</code>
By following this approach, you can accurately determine the true length of a UTF-8 encoded std::string, which is essential for various operations, such as character counting, string manipulation, and data parsing.
The above is the detailed content of How to Determine the True Length of a UTF-8 Encoded std::string in C ?. For more information, please follow other related articles on the PHP Chinese website!