In normal javaweb development, the need for character conversion is often found, and there will be Chinese I still don’t understand the problem of character conversion into garbled characters, how to solve it and the conversion principle, so I wrote a test code to try it out, and finally clarified the encoding. Let me come to the conclusion first. The summary is as follows:
Stored in utf8 There are various language encodings. Currently, utf8 is used for encoding and decoding in mainstream development. This method will not produce garbled codes. The following situations lead to garbled codes
1, gbk (Chinese), If you encode in other ways such as iso-8859-1 (no Chinese), you can only use its corresponding method to decode, otherwise it will be garbled
2. Use utf8 to encode and decode in other ways. It will cause garbled characters and requires a conversion
3. Encoding using a character set (iso-8859-1) without corresponding characters (Chinese) will cause garbled characters and the decoding cannot be restored
1. How to encode is how to decode
/** * 测试编码转换 中文 => utf-8 编码 - 解码 */ @Test public void test0() { String test = "测试"; System.out.println(Arrays.toString(test.getBytes(StandardCharsets.UTF_8)));//[-26, -75, -117, -24, -81, -107] System.out.println(new String(test.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));//测试 }
/** * 测试编码转换 中文 => gbk 编码 - 解码 */ @Test public void test1() throws UnsupportedEncodingException { String test = "测试"; System.out.println(Arrays.toString(test.getBytes("gbk")));//[-78, -30, -54, -44] System.out.println(new String(test.getBytes("gbk"), "GBK"));//测试 }
utf8 encoding-wrong form decoding
/** * 测试编码转换 中文 => utf-8 编码- gbk解码 */ @Test public void test2() throws UnsupportedEncodingException { String test = "测试"; System.out.println(Arrays.toString(test.getBytes(StandardCharsets.UTF_8)));//[-26, -75, -117, -24, -81, -107] System.out.println(new String(test.getBytes(StandardCharsets.UTF_8), "gbk"));//娴嬭瘯 }
The correct approach is to use the wrong decoding form (gbk) as a transfer, re-encode it (utf8-encode) according to the wrong form (gbk), and then use utf8 to perform a correct decoding (utf8-decode) to get the original Character
/** * 测试编码转换 中文 => utf-8 编码 - gbk 解码 ===> gbk 编码 - utf-8解码 * "测试" => (utf8-encode)[-26, -75, -117, -24, -81, -107] => (gbk-decode)娴嬭瘯 * "娴嬭瘯" => (utf8-encode)[-26, -75, -117, -24, -81, -107] => (utf8-decode)"测试" */ @Test public void test3() throws UnsupportedEncodingException { String test = "测试"; String test_gbk_utf8 = new String(test.getBytes(StandardCharsets.UTF_8), "gbk"); System.out.println(test_gbk_utf8);//娴嬭瘯 String test_utf8_gbk = new String(test_gbk_utf8.getBytes("gbk"), StandardCharsets.UTF_8); System.out.println(test_utf8_gbk);//测试 }
3. No corresponding character encoding
@Test public void test4() throws UnsupportedEncodingException { String test = "测试"; System.out.println(Arrays.toString(test.getBytes(StandardCharsets.ISO_8859_1)));//[63, 63] System.out.println(new String(test.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));//?? }
In this case, even if the original encoding method is used for decoding, the characters cannot be restored, and it is an irreversible state
The meaning of the following line of code is: Get the binary code of the gbk encoding format of the target string str, and then convert the binary The code is re-encoded into a string according to the utf8 encoding format. Of course, the following writing method will 100% be garbled because the encoding format is inconsistent.
new String(str.getBytes("gbk"),"utf8")
If you want to transmit a For strings, the string must first be converted into a byte stream according to a certain encoding format. When the byte stream is transmitted to the receiver, the byte stream must be converted into a string according to a certain encoding format. Garbled characters are also generated in In the process of re-converting to a string. The following is my test of Chinese garbled characters:
String str="彩虹"; String [] a=new String[] {"gbk","unicode","utf8","gb2312"}; for (int i=0;i<a.length;i++){ for (int j=0;j<a.length;j++){ System.out.println("二进制格式: "+a[i]+"编码格式: "+a[j]); System.out.println("编码后的字符串: "+new String(str.getBytes(a[i]),a[j])); } }
Binary format: gbk encoding format: gbk
Encoded string: Rainbow
binary Format: gbk encoding format: unicode
Encoded string: 닊뫧
Binary format: gbk encoding format: utf8
Encoded string: �ʺ�
Binary format: gbk encoding format: gb2312
Encoded string: Rainbow
Binary format: unicode encoding format: gbk
Encoded string: �_i唝
Binary format: unicode encoding format: unicode
After encoding String: Rainbow
Binary format: Unicode encoding format: utf8
Encoded string: ��_i�y
Binary format: Unicode encoding format: gb2312
Encoded string: ��_i�y
Binary format: utf8 encoding format: gbk
Encoded string: 褰╄櫣
Binary format: utf8 encoding format: unicode
Encoded string: ꧨ馹
Binary format: utf8 encoding format: utf8
Encoded string: Rainbow
Binary format: utf8 encoding format: gb2312
Encoded string: 褰╄��
binary Format: gb2312 encoding format: gbk
Encoded string: rainbow
Binary format: gb2312 encoding format: unicode
Encoded string: 닊뫧
Binary format: gb2312 encoding format: utf8
Encoded string: �ʺ�
Binary format: gb2312 encoding format: gb2312
Encoded string: Rainbow
It can be seen that if the binary encoding format and Different encoding formats of strings will cause garbled characters.
The reason why there is no garbled code in the conversion between gbk and gb2312 is because gbk is an enhanced version of gb2312 and supports More Chinese character encodings, so if the binary encoding format is gbk and the decoding format is gb2312, it is possible that some Chinese characters will be garbled.
The garbled characters in the above results can be roughly divided into two types, one is a complex combination of Chinese characters and graphics, and the other is "?".
If there are question marks in the garbled data you want to recover, then this The possibility of data recovery is not great. Because except for "?", other garbled characters actually have their own encoding rules. As long as they are reversely decoded and re-encoded according to the correct encoding format, they can be recovered. But except for "?", Because when the byte stream is recompiled according to a certain encoding format, the bytes in the byte data that cannot be converted into meaningful characters according to the encoding format will be converted into "?", so even if the byte stream is reversely encoded into a byte stream, all bytes in the byte data cannot be converted into meaningful characters according to the encoding format. "?" will be converted into the same byte, thus losing its own meaning.
如果乱码中不包含"?",那么还是有希望转换回去的,我以上述乱码中的 "褰╄櫣" 为例重新进行了一次转换,代码如下:
String str="褰╄櫣"; String [] charset=new String[] {"gbk","unicode","utf8","gb2312"}; for (int i=0;i<charset.length;i++){ for (int j=0;j<charset.length;j++){ System.out.println("二进制格式: "+charset[i]+"编码格式: "+charset[j]); System.out.println("编码后的字符串: "+new String(str.getBytes(charset[i]),charset[j])); } }
二进制格式: gbk编码格式: gbk
编码后的字符串: 褰╄櫣
二进制格式: gbk编码格式: unicode
编码后的字符串: ꧨ馹
二进制格式: gbk编码格式: utf8
编码后的字符串: 彩虹
二进制格式: gbk编码格式: gb2312
编码后的字符串: 褰╄��
二进制格式: unicode编码格式: gbk
编码后的字符串: ��0%Dj�
二进制格式: unicode编码格式: unicode
编码后的字符串: 褰╄櫣
二进制格式: unicode编码格式: utf8
编码后的字符串: ���0%Dj�
二进制格式: unicode编码格式: gb2312
编码后的字符串: ���0%Dj�
二进制格式: utf8编码格式: gbk
编码后的字符串: 瑜扳晞娅�
二进制格式: utf8编码格式: unicode
编码后的字符串: 냢閄�
二进制格式: utf8编码格式: utf8
编码后的字符串: 褰╄櫣
二进制格式: utf8编码格式: gb2312
编码后的字符串: 瑜扳��娅�
二进制格式: gb2312编码格式: gbk
编码后的字符串: 褰╄?
二进制格式: gb2312编码格式: unicode
编码后的字符串: ꧨ�
二进制格式: gb2312编码格式: utf8
编码后的字符串: 彩�?
二进制格式: gb2312编码格式: gb2312
编码后的字符串: 褰╄?
可以看到 其中一种转换方式成功的将乱码转变回了正常的中文汉字
二进制格式: gbk编码格式: utf8
编码后的字符串: 彩虹
The above is the detailed content of What is the java encoding conversion process?. For more information, please follow other related articles on the PHP Chinese website!