What is the cause of Chinese garbled characters?
Cause of Chinese garbled characters: decoding method and encoding method are inconsistent. A Chinese character encoded in UTF-8 will be converted into 3 bytes, and if encoded in gbk it will be converted into 2 bytes; and an English character encoded in UTF-8 will be converted into 1 byte, if encoded in gbk it will be converted into 1 byte.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
Let’s first talk about what garbled characters are
I don’t know if anyone has thought this before. A string contains not only characters, but also its encoding information. For example, String str = "Hello" in Java; I thought this before, the string str hides its encoding method unicode encoding or gbk, iso-8859-1, etc. This understanding is wrong. Characters are characters without any other information. The correct understanding should be that the string that people see in a file is the digital information in the memory that the system reads. Then decode it into some characters and finally display it. That is, when you double-click to open a text file, the system will read and display the digital information in the memory. When you save a text file, the system will encode the file in the encoding method you set. Then put it into memory. So garbled characters are also some characters, just strange characters, and there is no "code".
Let’s talk about the reasons for garbled codesWe often see the explanation of the reasons for garbled codes on the Internet: Garbled codes are caused by the inconsistency between the decoding method and the encoding method. This sentence itself There is nothing wrong, but the same sentence itself just summarizes the garbled code, and it does not help you understand the garbled code.
So the question we want to ask is: Why does the decoding method and encoding method differ and garbled characters appear.
Here are the three encoding methods of utf-8, gbk, and iso-8859-1 as examples.
@Test public void testEncode() throws Exception { String str = "你好",en = "h?h"; System.out.println("========中文字符utf-8======="); byte[] utf8 = str.getBytes(); // 以utf-8方式编码 ,default:utf-8 for (byte b : utf8) { System.out.print(b + "\t"); } System.out.println("\n"+"========英文字符utf-8======="); byte[] utf8_en = en.getBytes(); // 以utf-8方式编码 ,default:utf-8 for (byte b : utf8_en) { System.out.print(b + "\t"); } System.out.println("\n"+"========中文字符gbk========="); byte[] gbk = str.getBytes("gbk"); for (byte b : gbk) { System.out.print(b + "\t"); } System.out.println("\n"+"========英文字符gbk========="); byte[] gbk_en = en.getBytes("gbk"); for (byte b : gbk_en) { System.out.print(b + "\t"); } String s = new String(utf8,"utf-8"); String s1 = new String(utf8,"gbk"); System.out.println("\n"+s + "====gbk:" + s1); }
Test the above method and the printed result is:
========中文字符utf-8======= -28 -67 -96 -27 -91 -67 ========英文字符utf-8======= 104 63 104 ========中文字符gbk========= -60 -29 -70 -61 ========英文字符gbk========= 104 63 104 你好====gbk:浣犲ソ ------------------------------------------------------------------------------------
It can be concluded that:
A Chinese character is in utf-8 The encoding will be converted into 3 bytes. If encoded with gbk, it will be converted into 2 bytes.
An English character encoded with utf-8 will be converted into 1 Byte, if encoded in gbk, it will be converted into 1 byte.
It can be seen from the last line of printing combined with the 29-31 lines of code that if the byte array utf8 is decoded in utf-8 mode, there will be no garbled characters and it will still be the original "Hello", and if decoded in gbk mode, three garbled characters appear. Why are there 3 instead of 2? 6/2=3.
Next, let’s talk about iso-8859-1. This encoding is applied to the English series, which means that it cannot represent Chinese (if you want to use it, you must rely on other encodings that are compatible with the iso-8859-1 encoding method). Unreadable characters will be regarded as English question marks '?'. The iso-8859-1 encoding number of English question marks is: 63 (decimal) (in fact, in almost all encoding methods, all English characters are fixed with 1 bytecode representation, except unicode encoding).
@Test public void testISO() throws Exception { String str = "你好"; byte[] bs = str.getBytes("iso-8859-1"); for (byte b : bs) { System.out.println(b); } System.out.println(new String(bs,"iso-8859-1")); System.out.println(new String(bs,"utf-8")); System.out.println(new String(bs,"gbk")); System.out.println(new String(bs,"unicode")); }
Print results
63 63 ?? ?? ?? 㼿
Explanation 63 =》?, all Chinese are considered?, so when this code is executed: byte[] bs = "Hello".getBytes ("iso-8859-1");Information has been lost.
Execute String str = new String(bs, "any charset"); str is no longer equal to "Hello", but two question marks??. So in tomcat we often encounter Chinese characters changing into a long string of ??????, which is the origin of this.
In iso-8859-1, utf-8, and gbk, one bytecode represents an English character.
In unicode encoding, one bytecode cannot represent any character, and it is stipulated It takes two bytecodes (sometimes 4) to represent a character.
Having said so much, many people may ask why so many encoding methods are used. Isn’t it possible to unify them into utf-8 to represent all characters?
Encoding not only considers whether any characters can be represented, but also considers transmission and storage.
1. UTF-8 can indeed represent almost all known characters. As mentioned earlier, only 3 bytes represent a Chinese character in UTF-8 encoding, which obviously takes up space and is not conducive to transmission and storage (transmission and storage are both performed in binary)Understand the rules of various encoding methods: https://jingyan.baidu.com/article/020278118741e91bcd9ce566.html2. Undoubtedly, one byte represents one character in the most space-saving manner, such as iso-8859-1. But there are not only English characters in the world, but also characters from various regions and countries. So the number of characters must be greater than 2 to the 8th power.
So combining the above two points, many encoding methods naturally appear.
For more programming-related knowledge, please visit: Programming Teaching! !
The above is the detailed content of What is the cause of Chinese garbled characters?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.