Chinese-Auto The detection | should also be GB2312 |
|
It can be seen from the Table that each browser has different parsing for pages that do not use any means to declare encoding. Of course, in the simplest page, no matter what encoding is used (of course, the premise is a superset of ASCII), it has no impact, but it is enough to show the importance of setting the encoding correctly.
Encoding Statement
HTML4 and HTML5 each adopt a chapter to explain the encoding statement method. You can click here to view the relevant chapters of HTML4 or click here to view the relevant chapters of HTML5. chapter.
First of all, what is coding? Encoding is to specify the browser (or user agent) to use a special algorithm to parse the byte stream in a certain way to obtain the truly correct content. In the HTML standard, encodings can be represented using aliases. Encoding aliases come from the IANA definition, and only encodings that appear in this list can be recognized by browsers. Therefore, if UTF-8 is written as UTF8, the browser may completely ignore it. In addition, encoding aliases are case-insensitive.
In HTML4, there are three methods to specify the encoding of the page. According to the priority, they are:
The Content-Type field in the HTTP header is followed by characters set.
Use the fb0219da7bc07e28bd061c326636787c
tag to declare.
For some external resources, such as js files loaded by the 3f1c4e4b6b16bbbd69b2ee476dc4f83a
tag, they can be declared through the charset attribute on the tag.
Of course there is no doubt about this. It should be noted that if the page is declared through the fb0219da7bc07e28bd061c326636787c
tag, When the browser encounters this tag, if it finds that the encoding it uses does not match the tag declaration, it will go back to the beginning and re-parse the page. This will cause part of the page to be re-parsed, so if you are trying to use a tag to declare the encoding, it is recommended to write the tag as early as possible. A best practice is to write it after the 93f0f5c25f18dab9d176bd4f6de5d30e
tag and before any other tags. Regarding this point, Google PageSpeed also has a corresponding introduction.
Evolution of the Times
But as time went by, developers gradually discovered one thing. Just like the simplest statement of DOCTYPE, in fact, when the browser reads the encoding of the e8e496c15ba93d81f6ea4fe5f55a2244
tag, it does not strictly follow the standard. All in all, since in the HTML parsing stage, the encoding of the page must be determined before the Tokenizer stage, it is impossible for the browser to decompose it when the DOM tree is built like analyzing the DOM treee8e496c15ba93d81f6ea4fe5f55a2244
The structure of the tag, take out the http-equiv
and content
attributes, and then determine the encoding.
In reality, the browser does a very simple thing to read the encoding defined by the e8e496c15ba93d81f6ea4fe5f55a2244
tag:
- ## Make sure this is a
e8e496c15ba93d81f6ea4fe5f55a2244 tag. According to the
status machine of HTML parsing, the "11560dcfe6332e12b3e3af69e5b2db15
< ;meta charset="utf-8" />
- ##5c63b0c9e2aeb7f2e73596ac1dc2ffb0
...and many other weird ways of writing. So, as history progressed, finally one day, various browser manufacturers sat together and began to discuss this issue... In the end, they were surprised to find that their implementations were very similar. (Maybe they just learned from each other), so they decided to turn this method into a standard... Finally, after a long discussion, the widely loved coding declaration method in HTML5 was born. In HTML5, it is called a "meta charset element", and its simplest form is as follows: <meta charset=utf-8>
当然这是HTML的语法,如果遵从XHTML并觉得XHTML更加亲切地话,写成acbc3b8881b6b4f3abaf7379ad2340e7
也是没问题的。
而前文所述的具体获取编码的算法也被详细地记录在案,可以在这里看到。
到了HTML5时代,标准再次对编码的声明方式做了修正和细化,总得来说有以下的区别:
其他杂项
除了编码的基本声明方式外,标准中还有不少需要注意的细节:
如果使用e8e496c15ba93d81f6ea4fe5f55a2244
标签声明编码的话,该编码只能是ASCII的超集编码。可以简单地认为ASCII超集就是支持ASCII的256个字符的编码。
HTML5非常推荐使用UTF-8编码。
标准中提出不要使用UTF-32、JIS_C6226-1983、JIS_X0212-1990、HZ-GB-2312、JOHAB等字符集,并禁止使用CESU-8、UTF-7、BOCU-1和SCSU字符集。但事实上浏览器却至少能识别UTF-7。
对于想要严格遵守XHTML的开发者,应当使用XML声明来指定编码,即e4f551cb26a907a6bcdf652256fc4dfd
。但是这个在IE6下会影响到DOCTYPE,所以开发者也不得在这一点上给予妥协,乖乖地去用HTML的声明方式。
关于现实中各编码声明方式的优先级,以及一些其他需要注意的细节,这篇文章值得一读。
最佳实践
尽可能使用HTTP头指定编码。
尽可能使用UTF-8,或者至少全站所有资源使用统一编码。
如果想使用UTF-16,就给文件加上BOM,以确定是Little Endian还是Big Endian的。
如果使用e8e496c15ba93d81f6ea4fe5f55a2244
标签指定编码,可以不使用http-equiv的形式,但尽可能让标签出现在前面,至少保证在任何非ASCII字符之前。
链接外部的脚本,如果无法确定编码相同的话,加上charset属性。