How to solve java Chinese garbled characters-javaTutorial-php.cn

Home

Java

javaTutorial

How to solve java Chinese garbled characters

伊谢尔伦

Nov 26, 2016 am 09:55 AM

java

With the development and popularization of computers, countries around the world will design their own encoding styles in order to adapt to their own languages and characters. It is precisely because of this chaos that there are many encoding methods, so that the same binary number may be are interpreted into different symbols. In order to solve this incompatibility problem, the great idea Unicode encoding came into being! !

Unicode

Unicode is also called Unicode, Unicode, and Unicode. It was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique code for each character in each language. Binary encoding to meet the requirements for cross-language and cross-platform text conversion and processing. You can imagine Unicode as a "large character container" that contains all the symbols in the world, and each symbol has its own unique encoding, which fundamentally solves the problem of garbled characters. So Unicode is an encoding of all symbols [2].

Unicode developed with the standard of the universal character set and was also published in the form of a book. It is an industry standard that organizes and codes most of the writing systems in the world, making it easier for computers to use way to present and process text. Unicode is still being continuously revised and has included more than 100,000 characters so far. It is widely recognized by the industry and is widely used in the internationalization and localization process of computer software.

We know that Unicode was created to solve the limitations of traditional character encoding schemes. For traditional encoding methods, they all have a common problem: they cannot support multi-language environments, which is not suitable for the open environment of the Internet. Allowed. At present, almost all computer systems support the basic Latin alphabet, and each supports different other encoding methods. In order to be compatible with them, Unicode reserves the first 256 characters for the characters defined by ISO 8859-1, so that the conversion of existing Western European languages does not require special considerations; and a large number of the same characters are repeatedly encoded into different character codes Go, allowing the old and complicated encoding methods to be directly converted to and from Unicode encoding without losing any information [1].

Implementation method

The Unicode encoding of a character is determined, but in the actual transmission process, due to the different design of different system platforms and the purpose of saving space, the implementation of Unicode encoding is different. The implementation of Unicode is called Unicode Transformation Format (UTF for short) [1].

Unicode is a character set, which mainly has three implementation methods: UTF-8, UTF-16, and UTF-32. Since UTF-8 is the current mainstream implementation method, UTF-16 and UTF-32 are relatively rarely used, so the following will mainly introduce UTF-8.

UCS

When it comes to Unicode, it may be necessary to know about UCS. UCS (Universal Character Set) is a standard character set defined by the ISO 10646 (or ISO/IEC 10646) standard formulated by ISO. It includes all other character sets, ensuring two-way compatibility with other character sets, that is, if you translate any text string to UCS format and then translate back to the original encoding, you will not lose any information.

UCS not only assigns a code to each character, but also gives it an official name. Hexadecimal numbers representing a UCS or Unicode value are usually preceded by "U+", for example "U+0041" represents the character "A".

Little endian & Big endian

Due to the different designs of each system platform, some platforms may have different understanding of characters (such as the understanding of byte order). This will result in the byte stream being interpreted as different content. For example, the hexadecimal value of a certain character is 4E59, which is split into 4E and 59. When read on the MAC, it starts with the low-order bit. Then when the MAC encounters the byte stream, it will be parsed as 594E. Find The character is "Kui", but on the Windows platform, reading starts from the high byte, which is 4E59, and the found character is "B". In other words, "B" saved on the Windows platform will become "Kui" on the MAC platform. This will inevitably cause confusion, so two methods are used to distinguish between Big endian and Little endian in Unicode encoding. That is, the first byte comes first, which is the big-endian mode, and the second byte comes first, which is the little-endian mode. So a question arises at this time: How does the computer know which encoding method a certain file uses?

It is defined in the Unicode specification that a character indicating the encoding sequence is added to the front of each file. The name of this character is called "ZERO WIDTH NO-BREAK SPACE", represented by FEFF. This is exactly two bytes, and FF is one greater than FE.

If the first two bytes of a text file are FE FF, it means that the file uses big-endian mode; if the first two bytes are FF FE, it means that the file uses small-endian mode.

UTF-8

UTF-8 is a variable-length character encoding for Unicode. It can use 1~4 bytes to represent a symbol, and the byte length changes according to different symbols. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII. This allows the original system that processes ASCII characters to continue to be used without or with only minor modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

UTF-8 uses one to four bytes to encode each character. The encoding rules are as follows:

1) For single-byte symbols, the first bit of the byte is set to 0, and the next 7 bits are for this symbol. unicode code. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For n-byte symbols (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10 . The remaining binary bits not mentioned are all the unicode code of this symbol.

The conversion table is as follows:

How to solve java Chinese garbled characters

According to the above conversion table, it becomes very simple to understand the conversion encoding rules of UTF-8: If the first bit of the first byte is 0, it means this byte It is a character alone; if it is 1, the number of consecutive 1s indicates how many bytes the character occupies.

Take the Chinese character "yan" as an example to demonstrate how to implement UTF-8 encoding [3].

It is known that the unicode of "strict" is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the UTF-8 encoding of "strict" requires three Bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary digit of "strict", fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, we get that the UTF-8 encoding of "Yan" is "11100100 10111000 10100101", which converted to hexadecimal is E4B8A5.

Conversion between Unicode and UTF-8

Through the above example, we can see that the Unicode code of "Yan" is 4E25 and the UTF-8 encoding is E4B8A5. They are different and need to be converted by the program. To achieve this, the simplest and most intuitive method on the Window platform is Notepad.

There are four options at the bottom of "Encoding (E)": ANSI, Unicode, Unicode big endian, UTF-8.

ANSI: The default encoding method of Notepad is ASCII encoding for English files and GB2312 encoding for Simplified Chinese files. Note: Different ANSI codes are incompatible with each other. When information is exchanged internationally, text belonging to two languages cannot be stored in the same ANSI-encoded text.

Unicode: UCS-2 encoding method, that is, directly using Two bytes store the Unicode code of the character. This method is the "little endian" method.

Unicode big endian: UCS-2 encoding method, "big endian" method.

UTF-8: Read above (UTF-8).

> Viewer" and get the following results:

ANSI: The two bytes "D1 CF" are exactly the GB2312 encoding of "strict".

Unicode: Four bytes "FF FE 25 4E", where "FF FE" represents the small end storage method, and the real encoding is "25 4E".

Unicode big endian: four bytes "FE FF 4E 25", "FE FF" represents the big end storage method, and the real encoding is "4E 25".

UTF-8: The encoding is six bytes "EF BB BF E4 B8 A5". The first three bytes "EF BB BF" indicate that this is UTF-8 encoding, and the last three bytes "E4B8A5" are "strict" For specific encoding, its storage order is consistent with the encoding order.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How does platform independence benefit enterprise-level Java applications?May 03, 2025 am 12:23 AM

Java is widely used in enterprise-level applications because of its platform independence. 1) Platform independence is implemented through Java virtual machine (JVM), so that the code can run on any platform that supports Java. 2) It simplifies cross-platform deployment and development processes, providing greater flexibility and scalability. 3) However, it is necessary to pay attention to performance differences and third-party library compatibility and adopt best practices such as using pure Java code and cross-platform testing.

What role does Java play in the development of IoT (Internet of Things) devices, considering platform independence?May 03, 2025 am 12:22 AM

JavaplaysasignificantroleinIoTduetoitsplatformindependence.1)Itallowscodetobewrittenonceandrunonvariousdevices.2)Java'secosystemprovidesusefullibrariesforIoT.3)ItssecurityfeaturesenhanceIoTsystemsafety.However,developersmustaddressmemoryandstartuptim

Describe a scenario where you encountered a platform-specific issue in Java and how you resolved it.May 03, 2025 am 12:21 AM

ThesolutiontohandlefilepathsacrossWindowsandLinuxinJavaistousePaths.get()fromthejava.nio.filepackage.1)UsePaths.get()withSystem.getProperty("user.dir")andtherelativepathtoconstructthefilepath.2)ConverttheresultingPathobjecttoaFileobjectifne

What are the benefits of Java's platform independence for developers?May 03, 2025 am 12:15 AM

Java'splatformindependenceissignificantbecauseitallowsdeveloperstowritecodeonceandrunitonanyplatformwithaJVM.This"writeonce,runanywhere"(WORA)approachoffers:1)Cross-platformcompatibility,enablingdeploymentacrossdifferentOSwithoutissues;2)Re

What are the advantages of using Java for web applications that need to run on different servers?May 03, 2025 am 12:13 AM

Java is suitable for developing cross-server web applications. 1) Java's "write once, run everywhere" philosophy makes its code run on any platform that supports JVM. 2) Java has a rich ecosystem, including tools such as Spring and Hibernate, to simplify the development process. 3) Java performs excellently in performance and security, providing efficient memory management and strong security guarantees.

How does the JVM contribute to Java's 'write once, run anywhere' (WORA) capability?May 02, 2025 am 12:25 AM

JVM implements the WORA features of Java through bytecode interpretation, platform-independent APIs and dynamic class loading: 1. Bytecode is interpreted as machine code to ensure cross-platform operation; 2. Standard API abstract operating system differences; 3. Classes are loaded dynamically at runtime to ensure consistency.

How do newer versions of Java address platform-specific issues?May 02, 2025 am 12:18 AM

The latest version of Java effectively solves platform-specific problems through JVM optimization, standard library improvements and third-party library support. 1) JVM optimization, such as Java11's ZGC improves garbage collection performance. 2) Standard library improvements, such as Java9's module system reducing platform-related problems. 3) Third-party libraries provide platform-optimized versions, such as OpenCV.

Explain the process of bytecode verification performed by the JVM.May 02, 2025 am 12:18 AM

The JVM's bytecode verification process includes four key steps: 1) Check whether the class file format complies with the specifications, 2) Verify the validity and correctness of the bytecode instructions, 3) Perform data flow analysis to ensure type safety, and 4) Balancing the thoroughness and performance of verification. Through these steps, the JVM ensures that only secure, correct bytecode is executed, thereby protecting the integrity and security of the program.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Dead Rails - How To Tame Wolves

4 weeks agoByDDD

Strength Levels for Every Enemy & Monster in R.E.P.O.

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks agoByDDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

Zend Studio 13.0.1

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Hot Topics

1660

1416

1310

1260

1233