乱码 - c++字符串的编码？

Question

迷茫 · Answer

This question is quite complicated to explain clearly. You can refer to the article New Options for Managing Character Sets in the Microsoft C/C++ Compiler. Specifically, whether the string constants in the source code file can be displayed correctly in the console window is related to the following factors:

The character set used when saving the source code file (.cpp) (replaced by C1 below)
The internal character set (source character set) used by the compiler when reading source code files (hereinafter replaced by C2)
The execution character set used by the compiler during compilation (replaced by C3 below)
The character set used by the console window running the executable program (hereinafter replaced by C4)

First , the "character set used when saving the source code file" determines what the abstract characters in the source code file look like when saved to the hard disk Byte . The abstract characters mentioned here refer to characters that humans can recognize (such as "Zhang"). The bytes are mapped from abstract characters according to the provisions of the character set. The mapping results of different character sets are different (for example, the bytes mapped by "Zhang" under UTF-8 are E5 BC A0 three bytes, while under UTF-8 The mapped bytes under GBK are D5 C5 two bytes). For example, the following line of code:

char *s = "张三";

The string constant "张三" maps different bytes when saved using different character sets (C1).

Second, "The internal character set used by the compiler when reading the source code file" determines how the compiler converts the byte stream of the read source code file. What I mean by conversion is to convert a byte stream from one encoding to another. For example, for the abstract character "Zhang", if GBK encoding is used, the mapped byte stream is D5 C5, and when converted to UTF-8 encoding, this byte stream is converted into E5 BC A0.

This character set (C2) is determined by the compiler and is not specified in the standard. Different compilers may use different internal character sets, and different versions of the same compiler may use different internal character sets. In some versions of Visual C++, the internal character set is UTF-8. The compiler will try to determine the character set (C1) used by the source file and convert it to the internal character set (C2). If compiled If the browser cannot determine the character set used by the file, it will default to (C1) the code page of the current operating system (default code page) (If the judgment is wrong here, it will cause garbled characters or compilation errors).

For example: if the source file is UTF-8 with BOM, it can be correctly recognized by Visual C++, and the byte stream E5 BC A0 mapped by the abstract character "Zhang" will be correctly converted into E5 BC A0 ( unchanged). If the source file is UTF-8 without BOM, it cannot be correctly recognized by Visual C++. The compiler will use the current code page for conversion, and the byte stream E5 BC A0 mapped by the abstract character "Zhang" will be Treated as GBK encoding, incorrectly converted into other byte streams E5 AF AE ... (寮).

Third , "The character set used by the compiler during compilation" determines how the compiler transfers character/string constants encoded using the internal character set (C2) in the source code back to The character set used during compilation (C3). This character set (C3) is also determined by the compiler and is not specified in the standard. There are different types for character constants and string constants in C++, which correspond to different C3.

In Visual C++, refer to String and Character Literals and the blog mentioned above, you can deduce the C3 corresponding to different types of character/string constants:

// Character literals
auto c0 =   'A'; // char, encoded as default code page
auto c1 = u8'A'; // char, encoded as UTF-8
auto c2 =  L'A'; // wchar_t, encoded as UTF-16LE
auto c3 =  u'A'; // char16_t, encoded as UTF-16LE
auto c4 =  U'A'; // char32_t, encoded as UTF-32LE

// String literals
auto s0 =   "hello"; // const char*, encoded as default code page
auto s1 = u8"hello"; // const char*, encoded as UTF-8
auto s2 =  L"hello"; // const wchar_t*, encoded as UTF-16LE
auto s3 =  u"hello"; // const char16_t*, encoded as UTF-16LE
auto s4 =  U"hello"; // const char32_t*, encoded as UTF-32LE

The compiler converts the string constant from C2 to C3 according to its type (If the previous judgment is wrong, the error will continue to be retained here).

For example: auto s1 = "张";, the byte stream E5 BC A0 mapped by the abstract character "Zhang" in C2 (UTF-8) will be converted into the byte stream mapped in C3 (CP936, GBK) D5 C5.

auto s2 = u8"张";, the byte stream E5 BC A0 mapped by the abstract character "Zhang" in C2 (UTF-8) will be converted into the byte stream E5 BC A0 mapped in C3 (UTF-8) ( constant).

Fourth , "The character set used by the console window running the executable program" determines how to convert the byte stream in the compiled executable program into abstract characters and display them in the console . For example, the byte stream mapped by s1 in the previous step will be mapped back to the abstract character "Zhang" through C4 (CP 936), which is correct in our opinion. The byte stream mapped by s2 in the previous step will be mapped back to the abstract character "寮" through C4 (CP 936), which seems to us to be garbled code.

The above is what I understand about the encoding of characters/strings in C++. Please point out if there are any mistakes:-)

The questioner can try to save the following codes in Visual C++ into CP936, UTF-8 with BOM, and UTF-8 without BOM formats to see what the output results are.

#include 
#include 
using namespace std;

int main() {
  char *s1 = u8"张";
  char *s2 = "张";
  cout << "s1 " << sizeof(s1) << " " << strlen(s1) << " -> " << s1 << endl;  // Error in console
  cout << "s2 " << sizeof(s2) << " " << strlen(s2) << " -> " << s2 << endl;  // OK in console

  ofstream os("s1.txt");
  if (os.is_open()) {
    os << "s1 " << sizeof(s1) << " " << strlen(s1) << " -> " << s1 << endl;  
    os.close();
  }
  ofstream os2("s2.txt");
  if (os2.is_open()) {
    os2 << "s2 " << sizeof(s2) << " " << strlen(s2) << " -> " << s2 << endl;  
    os2.close();
  }
  ofstream os3("s3.txt");
  if (os3.is_open()) {
    os3 << "s1 " << sizeof(s1) << " " << strlen(s1) << " -> " << s1 << endl;
    os3 << "s2 " << sizeof(s2) << " " << strlen(s2) << " -> " << s2 << endl;
    os3.close();
  }

  cin.get();
  return 0;
}

Among the three files output by

, the first two files s1.txt and s2.txt can be guessed by normal text editors as their encoding formats, thus displaying the content correctly, but the third file s3.txt will display Partially or completely garbled, because it contains both UTF-8 encoded byte stream and GBK encoded byte stream, so the text editor does not know what encoding to use to map the byte stream back to abstract text. .

Character Encoding Model

标准引用和参考文档

From ISO C++11 § 2.3/1
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & |  ! = , \ " ’

From ISO C++11 § 2.3/2
The universal-character-name construct provides a way to name other characters.
hex-quad:
    hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
universal-character-name:
    \u hex-quad
    \U hex-quad hex-quad
The character designated by the universal-character-name UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. ...

From ISO C++11 § 2.3/3
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits.
...
The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

From ISO C++11 § 2.2/1
The precedence among the syntax rules of translation is specified by the following phases.

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. ... Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. ...

...

...

...

Each source character set member in a character literal or a string literal, as well as each escape sequence and universal-character-name in a character literal or a non-raw string literal, is converted to the corresponding member of the execution character set (2.14.3, 2.14.5); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.

...

From ISO C++11 § 2.14.3/5
A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementationdefined encoding. [ Note: In translation phase 1, a universal-character-name is introduced whenever an actual extended character is encountered in the source text. Therefore, all extended characters are described in terms of universal-character-names. However, the actual compiler implementation may use its own native character set, so long as the same results are obtained. — end note ]

From ISO C++11 § 2.14.5/6 7 15
6 After translation phase 6, a string literal that does not begin with an encoding-prefix is an ordinary string literal, and is initialized with the given characters.
7 A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
...
15 Escape sequences and universal-character-names in non-raw string literals have the same meaning as in character literals (2.14.3), except that the single quote ’ is representable either by itself or by the escape sequence ’, and the double quote " shall be preceded by a . In a narrow string literal, a universal-character-name may map to more than one char element due to multibyte encoding. ...

Character Sets
String and Character Literals
New Options for Managing Character Sets in the Microsoft C/C++ Compiler

乱码 - c++字符串的编码？

reply all(1)I'll reply