Coding problems often encountered in HTML and javascript

Home

Web Front-end

JS Tutorial

Coding problems often encountered in HTML and javascript_javascript skills

PHP中文网

May 16, 2016 pm 06:57 PM

htmljavascriptEncoding issues

In daily front-end development work, we often deal with HTML, JavaScript, CSS and other languages. Just like a real language, computer language also has its alphabet, grammar, lexicon, coding method, etc.

Here I will briefly talk about the coding issues that are often encountered in front-end HTML and JavaScript daily work.
In computers, the information we store is represented by binary codes. The mutual conversion between the symbols such as English and Chinese characters displayed on the screen and the binary codes used for storage is encoding.

There are two basic concepts that need to be explained, charset and character encoding:

charset, character set, which is a table of the mapping relationship between a certain symbol and a certain number, that is, it determines 107 is the 'a' of koubei, and 21475 is the "口" of Koubei. Different tables have different mapping relationships, such as ascii, gb2312, and Unicode. Through this mapping table of numbers and characters, we can convert a binary representation of a number. into a certain character.
chracter encoding, encoding method. For example, for the number 21475 that should be "mouth", should we use u5k3e3 to represent it, or should we use 口 to represent it? This is determined by character encoding.

For strings like 'koubei.com', which are commonly used characters in Americans, they developed a character set called ASCII, whose full name is american standard code of information interchange. , using the 128 numbers 0-127, (2 to the 7th power, 0×00-0×7f) represents the 128 commonly used characters such as 123abc. There are 7 bits in total, plus the first one is the sign bit, which is used to complement the one's complement to represent negative numbers and so on. A total of 8 bits constitute a byte. The Americans were a bit stingy back then. If a byte was designed from the beginning to have 16 bits or 32 bits, there would be fewer problems in the world. However, at that time, they probably thought that 8 bits was enough and could represent 128 different characters. !

Since computers were invented by Americans, they saved themselves the trouble and encoded all the symbols they use, which is quite fun to use. But when computers began to become internationalized, problems arose. Take China as an example. There are tens of thousands of Chinese characters. What should we do?

The existing 8 bits and one byte system is the foundation and cannot be destroyed or changed to 16 bits or the like. Otherwise, the changes will be too big and we can only take another path: use multiple ascii characters. To represent another character, that is, MBCS (Multi-Byte Character System, multi-byte character system).
With this MBCS concept, we can represent more characters. For example, if we use 2 ascii characters, there are 16 bits. In theory, there are 2 to the 16th power of 65536 characters. But how are these codes assigned to characters? For example, the Unicode encoding of "口" in Koubei is 21475. Who decided that? Character set, which is the charset just introduced. ascii is the most basic character set. On top of this, we have character sets similar to gb2312, big5 and other MBCS character sets for simplified Chinese and traditional Chinese. Finally, an organization called the Unicode Consortium decided to create a character set (UCS, Universal Character Set) that includes all characters and a standard for the corresponding encoding method, namely Unicode. Starting in 1991, it released the first version of the Unicode international standard, ISBN 0-321-18578-1, and the International Organization for Standardization ISO also participated in the customization of this, ISO/IEC 10646: the Universal Character Set. In short, Unicode is a character standard that basically covers all existing symbols on the earth. It is now being used more and more widely. The ECMA standard also stipulates that the internal characters of the JavaScript language use the Unicode standard (this means that JavaScript Variable names, function names, etc. are allowed in Chinese!).

For developers in China, they may encounter more problems such as conversion between gbk, gb2312, and utf-8. Strictly speaking, this statement is not very accurate. gbk and gb2312 are character sets (charset), and utf-8 is an encoding method (character encoding). It is an encoding method of the UCS character set in the Unicode standard, because Unicode characters are used The web pages of the collection are mainly encoded in UTF-8, so people often juxtapose them, which is actually inaccurate.

With Unicode, at least until human civilization encounters aliens, this is a master key, so everyone should use it. The most widely used Unicode encoding method now is UTF-8 (8-bit UCS/Unicode Transformation Format), which has several particularly good features:

Encoded UCS character set, universally used around the world
It is a variable-length character encoding method that is compatible with ascii
The second point is a big advantage. It makes the system that previously used pure ascii encoding compatible without adding additional storage ( Assuming that the fixed-length encoding method stipulates that each character consists of 2 bytes, then the storage space occupied by ASCII characters will double).

To explain UTF-8 clearly, it will be more convenient to introduce a table:

U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110 xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U- 04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

To understand this table, we only need to look at the first two lines

U-00000000 – U-0000007F:
0xxxxxxx The first line is like this, which means that if you find a utf- The binary code of the 8-encoded byte is 0xxxxxxx, which starts with 0, that is, between 0-127 in decimal. Then it is a single byte that represents a character, and has exactly the same meaning as the ascii code. All other utf8-encoded binary values start with 1, 1xxxxxxx, are greater than 127, and require at least 2 bytes to represent a symbol. So the first bit of a byte is a switch, indicating whether the character is an ASCII code. This is the compatibility just mentioned. From the English definition, it is the two attributes of utf8 encoding:

UCS characters U 0000 to U 007F (ASCII) are encoded simply as bytes 0×00 to 0× 7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U 007F are encoded as a sequence of several bytes , each of which has the most significant bit set. Therefore, no ASCII byte (0×00-0×7F) can appear as part of any other character.

Then let’s look at the second line:

U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
Look at the first byte first: 110xxxxx, its meaning is that I am not an ascii code (because the first bit is not 0), I am an The first byte of the multi-bytes character (the second bit is 1), the character I am participating in represents is composed of 2 bytes (the third bit is 0), starting from the fourth bit is where the character information is stored .
Look at the second byte: 10xxxxxx, its meaning is: I am not an ascii code (because the first bit is not 0), I am not the first byte of a multi-byte character (the second bit is 0) ), starting from the third bit is the location where the character information is stored.

It can be concluded from this example that in the UTF-8 encoding method, in a long series of continuous binary byte codes, a symbol may be represented by 2 to 6 bytes, so compared to using A byte represents the ASCII code of the symbol. We need space to store two additional information: First, the starting position of the symbol, a "starter" position. In biological terms, it is the start codon AUG during protein translation. Second, the number of bytes used by this symbol (in fact, if each symbol has a starter, this length does not need to be provided, but providing length information increases fault tolerance when some bytes are lost). The solution is: use whether the second bit of a byte is 1 to represent whether the byte is the starting byte of a character (because the first bit in a byte has just been used, 0 means ascii code, 1 means non ascii ), that is, the first bytes of a multi-byte symbol must be 11xxxxxx, a binary number between 192 and 255. Next, starting from the third bit, the length information is provided. The third bit is 0, which means that the symbol is 2 bytes. For each additional 1 starting from the third bit, the number of bytes occupied by the character increases by one. UTF-8 defines up to 6 bytes of characters, which requires 4 more 1s than a 2-byte starter like 110xxxxx, so this starter is 1111110x, as shown in the table above.
Look at the standard definition in English again, it expresses the same meaning:

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0×80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

Real information bits ( That is, the real digital information in the charset character set) is directly placed on the 'x' of the table above in binary format in order. Let’s take the Chinese characters that our Chinese programmers have the most contact with. Their encoding range is between U-00000800 – U-0000FFFF. From the table above, you can find that the UTF-8 encoding for this range uses three Represented by bytes (this is why utf-8 encoded Chinese characters use more storage space than EUC-CN encoded gb2312 character set Chinese characters that occupy 2 bytes per character), or use the word "口" of word-of-mouth For example, the number of mouth in Unicode is like this:
口: 21475 == 0×53e3 == Binary 101001111100011

In javascript, run this code (use firebug console, or Edit an HTML and insert the following code between a pair of script tags):

alert('u53e3′); //get '口'
alert(escape('口')); // get '%u53E3′
alert(String.fromCharCode('21475′)); // get '口'
alert('口'.charCodeAt(0)); // get '21475'
alert (encodeURI('口')); //get '口'

As you can see, the string literal can get the character '口' in the form of u hexadecimal Unicode code, and the fromCharCode method accepts 10 The hexadecimal Unicode code is used to obtain the character '口'.

The second alert got '%u7545′, which is a non-standard Unicode encoding and is part of the Percent encoding of URI. However, this method of use has been officially rejected by W3C and is not included in any RFC. Standard, ECMA-262 standard stipulates this behavior of escape, and it is estimated to be temporary.
What’s more interesting is the ‘mouth’ I got in the fifth alert. What is this? How did you get it?

This is Percent encoding, which is commonly used on URIs, and is specified in the RFC 3986 standard.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

From Websites to Apps: The Diverse Applications of JavaScriptApr 22, 2025 am 12:02 AM

JavaScript is widely used in websites, mobile applications, desktop applications and server-side programming. 1) In website development, JavaScript operates DOM together with HTML and CSS to achieve dynamic effects and supports frameworks such as jQuery and React. 2) Through ReactNative and Ionic, JavaScript is used to develop cross-platform mobile applications. 3) The Electron framework enables JavaScript to build desktop applications. 4) Node.js allows JavaScript to run on the server side and supports high concurrent requests.

Python vs. JavaScript: Use Cases and Applications ComparedApr 21, 2025 am 12:01 AM

Python is more suitable for data science and automation, while JavaScript is more suitable for front-end and full-stack development. 1. Python performs well in data science and machine learning, using libraries such as NumPy and Pandas for data processing and modeling. 2. Python is concise and efficient in automation and scripting. 3. JavaScript is indispensable in front-end development and is used to build dynamic web pages and single-page applications. 4. JavaScript plays a role in back-end development through Node.js and supports full-stack development.

The Role of C/C in JavaScript Interpreters and CompilersApr 20, 2025 am 12:01 AM

C and C play a vital role in the JavaScript engine, mainly used to implement interpreters and JIT compilers. 1) C is used to parse JavaScript source code and generate an abstract syntax tree. 2) C is responsible for generating and executing bytecode. 3) C implements the JIT compiler, optimizes and compiles hot-spot code at runtime, and significantly improves the execution efficiency of JavaScript.

JavaScript in Action: Real-World Examples and ProjectsApr 19, 2025 am 12:13 AM

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.

JavaScript and the Web: Core Functionality and Use CasesApr 18, 2025 am 12:19 AM

The main uses of JavaScript in web development include client interaction, form verification and asynchronous communication. 1) Dynamic content update and user interaction through DOM operations; 2) Client verification is carried out before the user submits data to improve the user experience; 3) Refreshless communication with the server is achieved through AJAX technology.

Understanding the JavaScript Engine: Implementation DetailsApr 17, 2025 am 12:05 AM

Understanding how JavaScript engine works internally is important to developers because it helps write more efficient code and understand performance bottlenecks and optimization strategies. 1) The engine's workflow includes three stages: parsing, compiling and execution; 2) During the execution process, the engine will perform dynamic optimization, such as inline cache and hidden classes; 3) Best practices include avoiding global variables, optimizing loops, using const and lets, and avoiding excessive use of closures.

Python vs. JavaScript: The Learning Curve and Ease of UseApr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Python vs. JavaScript: Community, Libraries, and ResourcesApr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.