


Detailed explanation of JavaScript language's support for Unicode character set_Basic knowledge
Last month, I shared a detailed introduction to the Unicode character set and its support in the JavaScript language. The following is the transcript of the speech shared this time.
1. What is Unicode?
Unicode originated from a very simple idea: include all the characters in the world in one set. As long as the computer supports this character set, it can display all characters, and there will no longer be garbled characters.
It starts from 0 and assigns a number to each symbol, which is called a "code point". For example, the symbol for code point 0 is null (meaning that all binary bits are 0).
In the above formula, U indicates that the hexadecimal number immediately following is the Unicode code point.
Currently, the latest version of Unicode is version 7.0, which contains a total of 109,449 symbols, including 74,500 Chinese, Japanese and Korean characters. It can be approximated that more than two-thirds of the existing symbols in the world come from East Asian scripts. For example, the code point for "good" in Chinese is 597D in hexadecimal.
With so many symbols, Unicode is not defined at once, but is defined in partitions. Each area can store 65536 (216) characters, which is called a plane. Currently, there are 17 (25) planes in total, which means that the size of the entire Unicode character set is now 221.
The first 65536 character bits are called the basic plane (abbreviation BMP). Its code point range is from 0 to 216-1. Written in hexadecimal, it is from U 0000 to U FFFF. All the most common characters are placed on this plane, which is the first plane defined and announced by Unicode.
The remaining characters are placed in the auxiliary plane (abbreviated as SMP), and the code points range from U 010000 to U 10FFFF.
2. UTF-32 and UTF-8
Unicode only stipulates the code point of each character. What kind of byte order is used to represent this code point involves the encoding method.
The most intuitive encoding method is that each code point is represented by four bytes, and the byte content corresponds to the code point one-to-one. This encoding method is called UTF-32. For example, code point 0 is represented by four bytes of 0, and code point 597D is preceded by two bytes of 0.
The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it wastes space. For the same English text, it will be four times larger than ASCII encoding. This shortcoming is so fatal that no one actually uses this encoding method. The HTML 5 standard clearly stipulates that web pages must not be encoded into UTF-32.
What people really needed was a space-saving encoding method, which led to the birth of UTF-8. UTF-8 is a variable-length encoding method, with character lengths ranging from 1 byte to 4 bytes. The more commonly used characters are, the shorter the bytes are. The first 128 characters are represented by only 1 byte, which is exactly the same as the ASCII code.
Number range bytes 0x0000 - 0x007F10x0080 - 0x07FF20x0800 - 0xFFFF30x010000 - 0x10FFFF4
Due to the space-saving characteristics of UTF-8, it has become the most common web page encoding on the Internet. However, it has little to do with today’s topic, so I won’t go into details. For specific transcoding methods, you can refer to "Character Encoding Notes" .
3. Introduction to UTF-16
UTF-16 encoding is between UTF-32 and UTF-8, and combines the characteristics of fixed-length and variable-length encoding methods.
Its encoding rules are very simple: characters in the basic plane occupy 2 bytes, and characters in the auxiliary plane occupy 4 bytes. That is to say, the encoding length of UTF-16 is either 2 bytes (U 0000 to U FFFF) or 4 bytes (U 010000 to U 10FFFF).
So there is a question. When we encounter two bytes, how do we know whether it is a character itself, or does it need to be interpreted together with the other two bytes?
It’s very clever. I don’t know if it is an intentional design. In the basic plane, from U D800 to U DFFF is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map auxiliary plane characters.
Specifically, there are 220 character bits in the auxiliary plane, which means that at least 20 binary bits are needed to correspond to these characters. UTF-16 splits these 20 bits in half. The first 10 bits are mapped from U D800 to U DBFF (space size 210), called the high bit (H), and the last 10 bits are mapped from U DC00 to U DFFF (space size 210). , called low bit (L). This means that an auxiliary plane character is split into two basic plane character representations.
Therefore, when we encounter two bytes and find that their code points are between U D800 and U DBFF, we can conclude that the code points of the following two bytes should be between U DC00 and U DBFF. U DFFF, these four bytes must be read together.
4. UTF-16 transcoding formula
When converting Unicode code points to UTF-16, first distinguish whether this is a basic flat character or an auxiliary flat character. If it is the former, directly convert the code point to the corresponding hexadecimal form, with a length of two bytes.
If it is an auxiliary flat character, Unicode version 3.0 provides a transcoding formula.
Take the character as an example. It is an auxiliary plane character with a code point of U 1D306. The calculation process of converting it to UTF-16 is as follows.
Therefore, the UTF-16 encoding of the character is 0xD834 DF06, and the length is four bytes.
5. Which encoding does JavaScript use?
JavaScript language uses the Unicode character set, but only supports one encoding method.
This encoding is neither UTF-16, nor UTF-8, nor UTF-32. None of the above coding methods are used in JavaScript.
JavaScript uses UCS-2!
6. UCS-2 encoding
Why did a UCS-2 suddenly appear? This requires a little history.
In the era before the Internet appeared, there were two teams who all wanted to create a unified character set. One is the Unicode team established in 1989, and the other is the earlier UCS team established in 1988. When they discovered each other's existence, they quickly reached an agreement: the world does not need two unified character sets.
In October 1991, the two teams decided to merge the character sets. In other words, from now on, only one character set will be released, which is Unicode, and the previously released character sets will be revised. The code points of UCS will be completely consistent with Unicode.
The actual situation at that time was that the development progress of UCS was faster than that of Unicode. As early as 1990, the first encoding method UCS-2 was announced, using 2 bytes to represent characters that already have code points. (At that time, there was only one plane, the basic plane, so 2 bytes were enough.) UTF-16 encoding was not announced until July 1996, and it was clearly announced that it was a superset of UCS-2, that is, the basic plane characters were inherited. UCS-2 encoding, auxiliary plane characters define a 4-byte representation method.
Simply put, the relationship between the two is that UTF-16 replaces UCS-2, or UCS-2 is integrated into UTF-16. So, now there is only UTF-16, no UCS-2.
7. Background of the birth of JavaScript
So, why doesn’t JavaScript choose the more advanced UTF-16, but uses the obsolete UCS-2?
The answer is simple: either you don’t want to or you can’t. Because when the JavaScript language appeared, there was no UTF-16 encoding.
In May 1995, Brendan Eich spent 10 days designing the JavaScript language; in October, the first interpretation engine came out; in November of the following year, Netscape officially submitted the language standard to ECMA (for details on the entire process, see 《 The Birth of JavaScript》). Comparing the release time of UTF-16 (July 1996), you will understand that Netscape had no other choice at that time, only UCS-2 was available as an encoding method!
8. Limitations of JavaScript character functions
Since JavaScript can only handle UCS-2 encoding, all characters in this language are 2 bytes. If it is a 4-byte character, it will be treated as two double-byte characters. JavaScript's character functions are all affected by this and cannot return correct results.
Still taking the character as an example, its UTF-16 encoding is 4 bytes of 0xD834 DF06. The problem arises. The 4-byte encoding does not belong to UCS-2. JavaScript does not recognize it and will only regard it as two separate characters, U D834 and U DF06. As mentioned before, these two code points are empty, so JavaScript will think that
is a string composed of two empty characters!
The above code indicates that JavaScript considers the length of the character to be 2, the first character obtained is a null character, and the code point of the first character obtained is 0xDB34. None of these results are correct!
To solve this problem, you must make a judgment on the code point and then adjust it manually. The following is the correct way to traverse a string.
The above code indicates that when traversing a string, a judgment must be made on the code point. As long as it falls in the range from 0xD800 to 0xDBFF, it must be read together with the following 2 bytes.
Similar problems exist with all JavaScript character manipulation functions.
String.prototype.replace()String.prototype.substring()String.prototype.slice()...
The above functions are only valid for 2-byte code points. To correctly handle 4-byte code points, you must deploy your own versions one by one to determine the code point range of the current character.
9. ECMAScript 6
The next version of JavaScript, ECMAScript 6 (ES6 for short), has greatly enhanced Unicode support and basically solved this problem.
(1) Correctly identify characters
ES6 can automatically recognize 4-byte code points. Therefore, iterating over the string is much simpler.
However, to maintain compatibility, the length attribute still behaves in its original way. In order to get the correct length of the string, you can use the following method.
(2) Code point representation
JavaScript allows Unicode characters to be directly represented by code points, which are written as "slash u code points".
However, this representation is not valid for 4-byte code points. ES6 fixes this problem, and the code points can be correctly recognized as long as they are placed within curly brackets.
(3) String processing function
ES6 adds several new functions that specifically handle 4-byte code points.
String.fromCodePoint(): Returns the corresponding character from the Unicode code point String.prototype.codePointAt(): Returns the corresponding code point from the character String.prototype.at(): Returns the character at the given position in the string
(4) Regular expression
ES6 provides the u modifier, which supports adding 4-byte code points to regular expressions.
(5) Unicode regularization
In addition to letters, some characters also have additional symbols . For example, in the Chinese Pinyin of Ǒ, the tones above the letters are additional symbols. For many European languages, tone marks are very important.
Unicode provides two representation methods. One is a single character with an additional symbol, that is, one code point represents one character, for example, the code point of Ǒ is U 01D1; the other is the additional symbol as a separate code point, combined with the main character, that is, two codes A dot represents a character, for example Ǒ can be written as O (U 004F) ˇ (U 030C).
//Method 1
'u01D1'
// 'Ǒ'
//Method 2
'u004Fu030C'
// 'Ǒ'
These two representation methods are exactly the same visually and semantically, and should be treated as equivalent. However, JavaScript can't tell.
'u01D1'==='u004Fu030C'
//false
ES6 provides the normalize method, allowing "Unicode normalization", that is, converting the two methods into the same sequence.
For more introduction to ES6, please see "Introduction to ECMAScript 6" .
==========================

I built a functional multi-tenant SaaS application (an EdTech app) with your everyday tech tool and you can do the same. First, what’s a multi-tenant SaaS application? Multi-tenant SaaS applications let you serve multiple customers from a sing

This article demonstrates frontend integration with a backend secured by Permit, building a functional EdTech SaaS application using Next.js. The frontend fetches user permissions to control UI visibility and ensures API requests adhere to role-base

JavaScript is the core language of modern web development and is widely used for its diversity and flexibility. 1) Front-end development: build dynamic web pages and single-page applications through DOM operations and modern frameworks (such as React, Vue.js, Angular). 2) Server-side development: Node.js uses a non-blocking I/O model to handle high concurrency and real-time applications. 3) Mobile and desktop application development: cross-platform development is realized through ReactNative and Electron to improve development efficiency.

The latest trends in JavaScript include the rise of TypeScript, the popularity of modern frameworks and libraries, and the application of WebAssembly. Future prospects cover more powerful type systems, the development of server-side JavaScript, the expansion of artificial intelligence and machine learning, and the potential of IoT and edge computing.

JavaScript is the cornerstone of modern web development, and its main functions include event-driven programming, dynamic content generation and asynchronous programming. 1) Event-driven programming allows web pages to change dynamically according to user operations. 2) Dynamic content generation allows page content to be adjusted according to conditions. 3) Asynchronous programming ensures that the user interface is not blocked. JavaScript is widely used in web interaction, single-page application and server-side development, greatly improving the flexibility of user experience and cross-platform development.

Python is more suitable for data science and machine learning, while JavaScript is more suitable for front-end and full-stack development. 1. Python is known for its concise syntax and rich library ecosystem, and is suitable for data analysis and web development. 2. JavaScript is the core of front-end development. Node.js supports server-side programming and is suitable for full-stack development.

JavaScript does not require installation because it is already built into modern browsers. You just need a text editor and a browser to get started. 1) In the browser environment, run it by embedding the HTML file through tags. 2) In the Node.js environment, after downloading and installing Node.js, run the JavaScript file through the command line.

How to send task notifications in Quartz In advance When using the Quartz timer to schedule a task, the execution time of the task is set by the cron expression. Now...


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Atom editor mac version download
The most popular open source editor

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version
Useful JavaScript development tools