Iterating through Unicode Codepoints in Java Strings
String#codePointAt() provides an efficient way to retrieve the Unicode codepoint at a specified character offset. However, developers may encounter challenges when attempting to iterate through codepoints sequentially.
One potential approach involves using String#charAt() to retrieve characters and then checking if they fall within the high-surrogate range. If a high surrogate is detected, String#codePointAt() can be used to obtain the codepoint and the index can be incremented by 2. For characters outside this range, the char value can be directly treated as the codepoint and the index can be incremented by 1.
However, this approach raises concerns regarding the encoding of characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme. Additionally, it may incur computational overhead due to the repeated character access operations.
For scenarios involving characters beyond the BMP, Java utilizes a Modified UTF-16 (UTF-16-esque) encoding internally. Characters outside the BMP are represented using a sequence of two surrogate code units. To iterate over codepoints efficiently in such cases, developers can employ the following canonical approach:
final int length = s.length(); for (int offset = 0; offset < length; ) { final int codepoint = s.codePointAt(offset); // perform operations on the codepoint offset += Character.charCount(codepoint); }
This approach iterates over the codepoints sequentially, handling characters within the BMP and those encoded using the surrogacy scheme effectively. By using codePointAt() and charCount(), it optimizes the process for efficient codepoint traversal in Java Strings.
The above is the detailed content of How to Efficiently Iterate Through Unicode Codepoints in Java Strings?. For more information, please follow other related articles on the PHP Chinese website!