Unraveling Java's String Representation: UTF-16 or Modified UTF-8?
In the realm of Java, the internal representation of strings has been a subject of debate. Two seemingly reliable sources present conflicting information:
One source suggests Java employs UTF-16 for internal text representation, while the other posits a modified version of UTF-8. Which of these claims holds true?
The Answer: UTF-16 for Internal Representation
Java adopts UTF-16 for its internal representation of text, including strings, string builders, and other related structures. This encoding system utilizes 16-bit Unicode code units to represent characters within the range U 0000 to U FFFF or the UTF-16 code units.
Modified UTF-8 for Serialization
While Java favors UTF-16 internally, it employs a non-standard variant of UTF-8 for the serialization of strings. Serialization involves transforming Java objects into a storable and transmittable format, and in this context, serialized strings are represented using modified UTF-8.
In-Memory Storage: Compressed Strings
At the JVM level, Java may employ compressed strings (activated by -XX: UseCompressedStrings), where strings that do not require UTF-16 encoding can be stored using 8-bit ISO-8859-1 encoding. This optimization reduces memory usage for specific types of strings.
Byte Usage for Char
A char variable in Java consistently occupies 2 bytes, regardless of padding considerations within an object.
Code Points and Character Representation
It's important to note that a code point, representing characters beyond the 65535 limit, may be expressed using either one or two characters (i.e., 2 or 4 bytes).
The above is the detailed content of How Does Java Internally Represent Strings: UTF-16 or Modified UTF-8?. For more information, please follow other related articles on the PHP Chinese website!