search
HomeDevelopment ToolsnotepadAbout optional character encoding in Windows Notepad

About optional character encoding in Windows Notepad

Feb 18, 2021 pm 05:37 PM
unicodewindowsnotebook

The following tutorial column of notepad will introduce you to the optional character encoding in Windows Notepad. I hope it will be helpful to friends in need!

About optional character encoding in Windows Notepad

A brief analysis of the optional character encodings in Windows Notepad

This article simply tests the behavior of Windows Notepad.

About optional character encoding in Windows Notepad

▲ Windows Notepad encoding includes ANSI, Unicode, Unicode big endian and UTF-8.

WARNING

This article only explains the technical facts of a widely used software and does not mean that the author supports or opposes the use of the software.
In fact the author recommends never using Windows Notepad to work with computer program code at any time.
This article is only verified on a certain instance of the Simplified Chinese version of 64-bit Windows 7 and is for reference only. There is no guarantee that consistent results can be reproduced on other identical or dissimilar systems.

Note

This article strictly distinguishes between Unicode's encoding and byte serialization.
Unicode's Encoding only refers to the work of using numbers (usually written as hexadecimal numbers) to represent characters one-to-one. The range of this number is restricted only by the Unicode standard and has nothing to do with computers.
Unicode's Byte serialization refers to the work of representing a number within the Unicode standard range into N bytes in order to be able to be written into computer memory.

Test case

The test case is: "锟斤拷[line break]a[line break]". (Kun Jin Kao is a belief.)

The GBK and Unicode encodings of all characters are:

  • 锟GBK=EFBF Unicode=U 951F
  • 金GBK=BDEF Unicode=U 65A4
  • copyGBK=BFBD Unicode=U 62F7

The GBK and Unicode encodings of the following ASCII characters are consistent with ASCII:

a=0x61 CR= 0x0D LF=0x0A
(One newline character in Windows occupies two characters: CR LF)

ANSI

Under the Simplified Chinese system, ANSI is the GBK encoding defined by the national standard of the People's Republic of China.

The result of Windows Notepad using ANSI to store this file is as follows:

EF BF  BD EF  BF BD  0D  0A  61  0D  0A
-----  -----  -----  --  --  --  --  --

Simply use GBK encoding to store all characters. A single byte with the highest bit not being 1 and equivalent to ASCII, otherwise a double byte.

We should pay attention to the issue of byte order (Endian) here[Note A]. You can see that the byte order here is big-endian.

But there is no need to specifically emphasize "GBK with big endian first" - because starting from GB2312, the standard stipulates that the storage method is big endian first[Note B]. Later GBK is backward compatible with GB18030-2000.

The trouble with ANSI is that it depends on the system - ANSI of other language systems is not GBK, and files opened in GBK will inevitably be garbled. And the character set of GBK itself is too small.
(Never say "I only use Chinese" - without the symbols of Unicode, the emojis on the Internet cannot be typed)

Unicode Series

What Windows Notepad said "Unicode", "Unicode big endian" and UTF-8 are all different byte serialization storage methods of the same Unicodeencoding.

UTF-16 and BOM

Unicode here refers to UTF-16[Note C]. UTF-16 is an extremely simple and crude serialization method - most Unicode characters are in the range of U 0000 ~ U FFFF [Note D], then each character uses two bytes , write the original value of Unicode encoding to disk.

Note that ASCII characters must also waste twice the space to store the upper 8 bits of 0x00 - because if the upper 8 bits of 0 are omitted, there will be no other basis for hyphenation during parsing.

For UTF-16, there is a big-endian and little-endian problem - UTF-16 does not specify whether the byte is big-endian or little-endian first. But UTF-16 does not contain information indicating byte order. You can't manually check which parsing is not garbled...

The solution provided by Unicode is to convert a zero-width unbreakable After the character space character (U FEFF ZERO WIDTH NO-BREAK SPACE) is serialized in UTF-16, it is stuffed at the front of the file. In this way, the UTF-16 parser reads the first two bytes of the file. If it is FE FF, it means big end first, and FF FE means little end first.

This stuffed thing is called BOM (Byte Order Mark, byte order mark).

It is worth mentioning that Zero-width non-hyphenation space character is also often used as a valid character to break the word limit in various situations. Includes SegmentFault's Q&A and comments.

"Unicode" and "Unicode big endian" in Notepad

Writing "Unicode" alone is not a complete expression of a storage method at all. Because this only contains encoding and not byte serialization.

M$出现这种错误,我一点都不觉得奇怪。死记结论就可以了:Windows Notepad的“Unicode”就是UTF-16

Windows Notepad使用“Unicode” = 小端在先的UTF-16,存储这个文件的结果如下:

 FF FE 1F 95 A4 65 F7 62 0D 00 0A 00 61 00 0D 00 0A 00
 -BOM- ----- ----- ----- ----- ----- ----- ----- ----- 
U+FEFF  951F  65A4  62F7  000D  000A  0061  000D  000A <p>Windows Notepad使用<strong>“Unicode big endian” = 大端在先的UTF-16</strong>,存储这个文件的结果如下:</p><pre class="brush:php;toolbar:false"> FE FF 95 1F 65 A4 62 F7 00 0D 00 0A 00 61 00 0D 00 0A
 -BOM- ----- ----- ----- ----- ----- ----- ----- ----- 
U+FEFF  951F  65A4  62F7  000D  000A  0061  000D  000A <h3 id="UTF">UTF-8</h3><p>UTF-8是一种用1~4个字节表示1个Unicode字符的<strong>变长的</strong>字节序列化方法。具体的实现细节看这篇文章。UTF-8的好处在于:</p><ol>
<li>无论是IETF的推荐,还是实际业界的执行,UTF-8都是互联网的标准。</li>
<li>向下兼容,ASCII字符UTF-8序列化后仍是原样,任何ASCII文件也是有效的UTF-8文件。</li>
<li>没有字节序问题。UTF-8的字节序是由RFC3629定死的。</li>
</ol><p>Windows Notepad使用UTF-8存储这个文件的结果如下:</p><pre class="brush:php;toolbar:false"> EF BB BF  E9 94 9F  E6 96 A4  E6 8B B7  0D   0A   61   0D   0A
 --BOM---  --------  --------  --------  --   --   --   --   --
U+ FEFF      951F      65A4      62F7   000D 000A 0061 000D 000A <p>注意UTF-8前边仍然塞进去了<code>U+FEFF</code>按照UTF-8序列化的结果<code>EF BB BF</code>,作为前边提到过的<strong>BOM</strong>字节顺序标记。<strong>Windows Notepad存储的UTF-8,是带有BOM标记的UTF-8</strong>。</p><p>但是如果仅仅对于UTF-8而言,字节序是没有意义的。因为UTF-8的字节序被规范写死,<code>U+FEFF</code>编码后必然得到<code>EF BB FF</code>,得不出其他的。没有二义性,BOM就失去了原本的意义。也许只有区别UTF-8文件和UTF-16文件的用处……</p><p>如何对待UTF-8文件的BOM,RFC3629的第6章有详细的规定,不加详述。</p><p>值得一提的是,BOM我想很多PHP程序员都经历过并且恨之入骨——PHP不认识文件中的BOM头并会将其作为HTTP Response的正文送出。这甚至在无缓冲的情况下,会导致<code>header()</code>等必须在Response开始前执行的函数直接失效。</p><p>所以PHP程序员总是会喜欢<strong>UTF-8 without BOM</strong>的编码方式——这基本也就宣布了Windows下的PHP开发,Windows Notepad完全的淘汰出局,哪怕是任何一星半点代码的临时修改。</p><h2 id="番外-Notepad-的字符编码测试">番外:Notepad++的字符编码测试</h2><p>ANSI没有区别,但Notepad++支持选择多国编码的不同ANSI编码方式(类似浏览器里选编码),可以轻松生成或读取Shift-JIS等其他字符集的文件。适合用于对付日文老游戏的<code>README</code>等文档。</p><p>UCS-2 Big Endian、UCS-2 Little Endian和前边UTF-16的两个例子一致。注意UTF-16的文件不提供“无BOM”的存储方法(提供了就坏了)。</p><p>UTF-8仍然代表“带有BOM标记的UTF-8”。但同时提供PHP程序员最爱的UTF-8 without BOM,就像:</p><pre class="brush:php;toolbar:false"> E9 94 9F  E6 96 A4  E6 8B B7  0D   0A   61   0D   0A
 --------  --------  --------  --   --   --   --   --
U+ 951F      65A4      62F7   000D 000A 0061 000D 000A <p>Simple and clean.</p><blockquote><p><strong>注解</strong><br><code>[注A]</code> 对于一个双(多)字节的数,一定会按8位截断为1字节后写盘。那么写盘时先写最低8位还是先写最高8位,就是所谓的“字节序”(Endian)问题。例如,数<code>0x01020304</code>写盘时,是先写最低8位的<code>04 03 02 01</code>,还是先写最高8位的<code>01 02 03 04</code>?<br>
  先写低8位的叫做小端在先(little-endian),先写高8位的叫做大端在先(big-endian)。实际采用何种字节序受系统环境、标准规范和软件实际编写的多方面控制,不一概而论。<br><code>[注B]</code> 字节序如果我没弄错,是GB2312采用的EUC字符编码方法控制的。<br><code>[注C]</code> 本文并不严格区分<strong>UTF-16</strong>与<strong>UCS-2</strong>。<br><code>[注D]</code> Unicode的最大值实际上达到了U+10FFFF,超出了两个字节能够存储的限度。<br>
  但Unicode由于历史原因,留下了U+D800~U+DFFF这一段永久保留不用的空缺区域。<br>
  因此对U+10000及以上的字符,UTF-16借助了这部分空缺区域,对这些编码超大的字符打破2字节16位的惯例,特别的用4字节32位去表示之。<br>
  这一部分编码值太大的字符,超出了GBK的字符集范围,因此本文将<strong>完全忽略</strong>。如有机会再进一步测试。</p></blockquote>

The above is the detailed content of About optional character encoding in Windows Notepad. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:segmentfault. If there is any infringement, please contact admin@php.cn delete
The Notepad   Installation Process: A Detailed ExplanationThe Notepad Installation Process: A Detailed ExplanationApr 22, 2025 am 12:03 AM

Installation steps of Notepad: 1. Visit the official website to download the latest stable version; 2. Run the installation file and click "Next"; 3. Agree to the license agreement; 4. Select the installation path; 5. Select whether to create a desktop shortcut and start menu folder; 6. Complete the installation and start Notepad.

Notepad  : Exploring Pricing and LicensingNotepad : Exploring Pricing and LicensingApr 21, 2025 am 12:12 AM

Notepad is free and open source, and is licensed under the GPLv2. 1. Anyone can use and modify Notepad for free. 2. When used by the enterprise, any modification or extension must be published in GPLv2. 3. The use of commercial products must comply with GPLv2, including public source code.

Notepad  : The Nation Behind the EditorNotepad : The Nation Behind the EditorApr 20, 2025 am 12:08 AM

Notepad originatesfromFrance,createdbyDonHo.1)France'sfocusoneducationandtechnologyfostersinnovation,reflectedinNotepad 'sdesign.2)Theopen-sourceethosalignswithFrenchvaluesofsharingknowledge.3)EfficiencyandperformancearehallmarksofFrenchengineering

Notepad  : Understanding the Financial ModelNotepad : Understanding the Financial ModelApr 19, 2025 am 12:11 AM

Notepad sustainsitselffinanciallythroughdonations,sponsorships,andapluginecosystem.1)Donationsfromusersprovidethemainincome,keepingthetoolfreeandfosteringcommunity.2)Sponsorshipsfromcompaniesofferastableincomewhilemaintainingindependence.3)Apluginec

Accessing Notepad  : Exploring Free OptionsAccessing Notepad : Exploring Free OptionsApr 18, 2025 am 12:07 AM

Free alternatives to Notepad include VisualStudioCode, SublimeText, and Atom. 1. VisualStudioCode supports multiple languages ​​and enhances features through extensions. 2. SublimeText provides an evaluation version, which is fast and has a simple interface. 3.Atom is known for its high customizability and is suitable for personalized needs.

Notepad  : Examining the Free and Open Source NatureNotepad : Examining the Free and Open Source NatureApr 17, 2025 am 12:07 AM

Notepad is a free and open source text editor. 1) Free use lowers the entry threshold, 2) Open source features are implemented through the GPLv2 license, allowing the viewing, modifying and distributing source code, promoting community participation and software evolution.

How to use notepadHow to use notepadApr 16, 2025 pm 08:09 PM

Notepad is a free text editor for Windows, which offers a variety of features such as: 1) syntax highlighting, 2) autocomplete, 3) macro recording, 4) plug-in extensions, 5) customizable interfaces and settings.

How to automatically type notepadHow to automatically type notepadApr 16, 2025 pm 08:06 PM

Notepad itself does not have automatic layout function. We can use a third-party text editor, such as Sublime Text, to perform the following steps to achieve automatic typography: 1. Install and open the text editor. 2. Open the file that needs to be automatically typed. 3. Find and select the automatic layout function. 4. The editor will automatically type the text. 5. You can customize the layout rules as needed. Automatic typography can save time and ensure text consistency and professionalism.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.