js里var name=''张三" 那么name.length 这个属性在UTF-8 GB2312里分别等于多少?为什么???
伊谢尔伦2017-04-17 11:22:09
It seems that this problem cannot be solved simply by talking about it. I'll add it to the answer.
First of all, the internal representation of String in JavaScript is always UTF16, and its length is always calculated according to the UTF16 code point. In short, length always returns the number of characters, not the byte size!
So, why did someone test and find that for the same string "张三"
, 2 is returned when encoded with UTF8, but 3 is obtained when encoded with GBK? That's because the browser failed to correctly identify the file encoding, so it made a wrong judgment. The result returned 3 is a result of program running error.
There are so many concepts in this. Let’s first discuss when JS is written in an HTML page, and then discuss the situation when src loads external JS.
Suppose the encoding of the test.html file itself is GBK. Then when the browser loads test.html, how does it know its encoding?
There are roughly the following options:
Content-Type: text/html;charset=gbk
in the HTTP Header. For example, you can add code header("Content-Type: text/html;charset=gbk")
in PHP to tell the browser to encode <meta http-equiv="Content-Type" content="text/html;charset=gbk"/>
or <meta charset="GBK"/>
in the HTML head. If so, use the encoding specified here to parse the HTML. Note that if there are Chinese characters before this, they may be parsed into garbled characters: <head>
<title>可能显示成乱码</title>
<meta charset="GBK"/><!--It's too late here-->
</head>
If the charset information is not found in the above, you can actually judge it through the BOM. If there is no BOM, the browser can only guess. How to guess specifically? Why do we need to know this? Can we still rely on the browser to guess this coding riddle when we write code?
What if the browser ultimately fails to correctly determine the encoding of the HTML file? Then it’s garbled! For text in HTML, you can easily see that it appears as garbled characters. But for JS files, you can’t see it so easily, like this:
a="两个"
alert(a+"\n"+a.length); //当a.length不为2的时候,前面a肯定显示成乱码
Or ask, why is 3 output in this case? Stop asking! !
Wrong input! Wrong output! , even if the output of 2 is garbled, it is regarded as wrong, and it is not worth spending that space to explain this.
Okay, do you want to create an output of 3? It's very simple, save the following code to the test.html file using UTF8 encoding:
<meta charset="GBK" />
<script>
a="两个"
alert(a+"\n"+a.length); //当a.length不为2的时候,前面a肯定显示成乱码
</script>
In this way, the browser believes the GBK encoding you specified in the meta charset, but in fact the HTML file is UTF8 encoded, so it will be garbled!
Okay, let’s talk about the case where src refers to external JS.
Assume that the external JS file test.js file is encoded as UTF8 and test.html is encoded as GBK. This JS file is referenced through the following code: <script src="test.js" charset="UTF8"></script>
. So at this time, how does the browser determine the encoding of the JS file? Similar paths:
Content-Type: text/javascript;charset=gbk
in the HTTP Header returned by test.js. Of course, if you use the file:// protocol, there is no such thing. If the encoding of HTML is inconsistent with that of external JS, and no <script charset="XXX">
or HTTP Header is specified, the code will be garbled and the previous effect will be caused.
Oh, this is really nonsense. There have been many articles discussing this issue. Just read these articles!
For example, these two articles are too long to read: http://ued.taobao.org/blog/2011/08/encode-war/, http://tgideas.qq.com/webplat/info/news_version3/804/ 808/811/m579/201307/218730.shtml. What W3C says about script charset: http://www.w3.org/TR/html5/scripting-1.html#attr-script-charset
It’s not a good habit not to watch for too long! ! !
Finally, the best practice is of course: always encode in UTF8, and specify script charset for external JS.
阿神2017-04-17 11:22:09
To understand this problem, let’s first go back to the definition of this attribute String.length
. Let’s go to MDN to check it out: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/ Global_Objects/String/length
This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.
MDN clearly describes that JS calculates the length of a string under utf-16
encoding. If the encoding of your current file (can be obtained from document.charset) is utf-8
, the correct result will be returned. If your file encoding is gb2312
, JavaScript does not know which encoding you are using. If it defaults to searching from gb2312
, unexpected errors will occur. The solution to this problem is to set the encoding correctly, utf-16
. (Thank you very much @Jex for the detailed explanation of this part) <meta charset="utf-8" />