Home  >  Article  >  Java  >  How do you iterate through Unicode codepoints in Java Strings?

How do you iterate through Unicode codepoints in Java Strings?

Linda Hamilton
Linda HamiltonOriginal
2024-10-25 14:10:02180browse

How do you iterate through Unicode codepoints in Java Strings?

Iterating through Unicode Codepoints in Java Strings

Introduction

Iterating through the Unicode codepoints of a Java String requires a unique approach as Java uses a UTF-16-esque encoding. This article explores different strategies and addresses concerns regarding the encoding of characters outside the Basic Multilingual Plane (BMP).

Approaching the Problem

Initially, one might consider using String#codePointAt(int) indexed by character offset. However, this approach presents two concerns: it's not indexed by codepoint offset, and handling codepoints outside the BMP poses challenges.

An alternative approach involves using String#charAt(int) to obtain characters and testing their membership in the high-surrogates range. While this method provides a way to determine if a codepoint is outside the BMP, it comes with the following drawbacks:

  • Uncertainty about the representation of BMP-range codepoints
  • High computational cost

The Optimal Solution

Fortunately, Java provides the canonical way to iterate over codepoints using String#codePointAt(int):

<code class="java">for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}</code>

Addressing Concerns

  • Java indeed uses a UTF-16-esque encoding, storing characters outside the BMP as surrogates.
  • The code provided above handles BMP-range codepoints correctly.
  • Increasing the offset by Character.charCount(codepoint) correctly navigates surrogate pairs.

Conclusion

To summarize, iterating through Unicode codepoints in Java Strings requires a deeper understanding of the underlying encoding. However, using the canonical approach outlined in this article provides a correct and efficient solution for this common need.

The above is the detailed content of How do you iterate through Unicode codepoints in Java Strings?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn