Home  >  Q&A  >  body text

Java中关于char和String对于代码点和代码单元的提问

Java中采用的是Unicode,并且使用UTF-16进行编码.
首先,Unicode中有17个代码层次,除了第一个代码层次意外其余16个代码层次全部需要2个代码单元组成.那么问题就来了:
1.String类的length()方法,在官方API中写明了是返回字符串中包含代码单元的数量,那字符串中如果有中文的话(中文属于其余16个代码层次),那一个中文对应2个代码单元,但我在实际测试当中并非得到这样的结果,而是返回字符的数量(即代码点的数量),而非代码单元的数量.这是我问题之一.
2.char在Java中以16位的形势存在,而1个代码单元占16位.对于第一代码层次UTF-16编码之后的代码单元为16位,对于其它代码层次的代码点编码之后是两个代码单元,即16*2 = 32位.那么一个char类型是不足以储存其它代码层次的代码点,也无法储存需要32位才能存储的中文.但是也实际测得的结果是能够储存的.这是我问题之二.

public class Hello {

public static void main(String[] args) {
    // TODO Auto-generated method stub
    String green = "国家";
    int countUnit = green.length();
    int countPoint = green.codePointCount(0, green.length());
    char character = green.charAt(0);
    System.out.printf(character+" "+countUnit+" "+countPoint);
}//输出结果为国 2 2,但按照这个逻辑应该是"(一个未知的代码单元) 4 2"

}`

PHP中文网PHP中文网2741 days ago797

reply all(2)I'll reply

  • 高洛峰

    高洛峰2017-04-18 09:53:35

    Unicode character encoding has two schemes: 16-bit encoding and 32-bit encoding. The corresponding character sets are called USC-2 and USC-4 respectively. The Java language uses the USC-2 character set, which is a 16-bit Unicode character encoding. The first 128 characters are exactly the same as the ASCII character set, followed by other languages, such as Latin, Greek, Chinese characters, etc.

    char is 2 bytes in java. Java uses Unicode, 2 bytes (16 bits) to represent a character.

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-18 09:53:35

    Not all Chinese character encodings occupy two code units. The Unicode encodings corresponding to the two characters "country" are u56fd u5bb6, and each character only occupies one unit. Some Chinese characters need to be encoded with two code units, such as the characters included in CJK Unified Chinese Character Extension A. For example: "

    reply
    0
  • Cancelreply