Example tutorial on correctly reading Chinese encoded files in .NET (C#)-C#.Net Tutorial-php.cn

Home

Backend Development

C#.Net Tutorial

Example tutorial on correctly reading Chinese encoded files in .NET (C#)

Y2J

Apr 24, 2017 pm 04:56 PM

.netc#

First of all, if the reader is not familiar with encoding or BOM, it is recommended to read this article first: .NET (C#): Character Encoding (Encoding) and Byte Order Mark (BOM).
Chinese coding can basically be divided into two categories:
1. Extended set of ANSI coding: such as GBK, GB2312, GB18030, etc. There is no BOM for this type of coding (some newer standard Chinese coding, such as GB18030 and GBK encoding, all are backward compatible with GB2312 encoding).
2. Unicode encoding set: such as UTF-8, UTF-16, UTF-32, etc. This type of coding can have BOM or not.
3. Some Unicode encodings also have specific byte order issues (Endianess), which are the so-called Little endian and Big endian. Different section orders correspond to different BOMs, such as UTF16, but UTF8 does not have byte order issues. .

OK, after understanding the basic knowledge, let us return to the topic, how to open Chinese text files correctly. The first information that needs to be confirmed is: Does your Unicode encoded file contain a BOM?

If BOM is included, then everything is easy to say! Because if we find the BOM, we will know its specific encoding. If the BOM is not found, it is not Unicode. We can open the text file using the system's default ANSI extended Chinese encoding set and it will be OK.
If the Unicode encoding does not have a BOM (obviously, you cannot guarantee that all Unicode files given to you by users have BOM), then you have to manually determine whether it is GBK from the original bytes? Or UTF8? Or other encoding? . This requires a specific encoding detection algorithm (you can google "charset|encoding detection"). Of course, the encoding detection algorithm may not be 100% accurate. It is precisely because of this that Windows Notepad has Bush hid the facts bug. When browsing the web in Chrome, you will also encounter garbled characters. Personally, I feel that Notepad++'s coding awareness is quite accurate.
There are many coding awareness algorithms, such as this project: https://code.google.com/p/ude

If Unicode comes with BOM, there is no need for a third-party class library . However, there are some things that need to be explained.

The problem is that the text reading methods (File class and StreamReader) in .NET read in UTF8 encoding by default, so this type of GBK text file is directly opened with .NET (if no encoding is specified). It must be gibberish!

First of all, the most effective solution here is to use the system default ANSI extended encoding, which is the system default non-Unicode encoding to read text. Reference code:

//输出系统默认非Unicode编码Console.WriteLine(Encoding.Default.EncodingName);//使用系统默认非Unicode编码来打开文件var fileContent = File.ReadAllText("C:\test.txt", Encoding.Default);

in Simplified Chinese Windows The system should output:

Simplified Chinese (GB2312)...

And using this method is not limited to Simplified Chinese.

Of course, you can also manually specify an encoding, such as GBK encoding, but if you use the specified GBK encoding to open a Unicode file, will the file still be opened successfully? The answer is still successful. The reason is that .NET will automatically detect the BOM by default when opening a file and use the encoding obtained based on the BOM to open the file. If there is no BOM, the file will be opened with the encoding area specified by the user. If the user does not specify the encoding, UTF8 encoding will be used.

This "automatically aware of BOM" parameter can be set in the constructor of StreamReader, corresponding to the detectEncodingFromByteOrderMarks parameter.

But it cannot be set in the corresponding method of the File class. (For example: File.ReadAllText).

For example, the following code uses:

GB2312 encoding, automatically detecting BOM to read GB2312 text

GB2312 encoding, automatically detecting BOM to read Unicode text

GB2312 encoding, reading Unicode text without noticing the BOM

static void Main(){    var gb2312 = Encoding.GetEncoding("GB2312");    //用GB2312编码，自动觉察BOM 来读取GB2312文本    ReadFile("gbk.txt", gb2312, true);    //用GB2312编码，自动觉察BOM 来读取Unicode文本    ReadFile("unicode.txt", gb2312, true);    //用GB2312编码，不觉察BOM 来读取Unicode文本    ReadFile("unicode.txt", gb2312, false);}//通过StreamReader读取文本 static void ReadFile(string path, Encoding enc, bool detectEncodingFromByteOrderMarks){    StreamReader sr;    using (sr = new StreamReader(path, enc, detectEncodingFromByteOrderMarks))    {        Console.WriteLine(sr.ReadToEnd());    }}

Output:

a刘a刘???

The third line is garbled.

Seeing the above, using GB2312 encoding to open Unicode files will also be successful. Because the "Automatically detect BOM" parameter is True, when it is found that the file has a BOM, .NET will detect that it is a Unicode file through the BOM, and then use Unicode to open the file. Of course, if there is no BOM, the specified encoding parameters will be used to open the file. For GB2312 encoded text, there is obviously no BOM, so GB2312 encoding must be specified, otherwise .NET will use the default UTF8 encoding to parse the file, and the result will not be read. The reason for the garbled characters in the third line is that "automatically detect BOM" is False. .NET will directly use the specified GB2312 encoding to read a Unicode encoded text file with BOM, which obviously cannot be successful.

Of course, you can also determine the BOM yourself. If there is no BOM, specify a default encoding to open the text. I wrote about it in a previous article (.NET (C#): Encoding detection from files).

Code:

static void Main(){    PrintText("gb2312.txt");    PrintText("unicode.txt");}//根据文件自动觉察编码并输出内容static void PrintText(string path){    var enc = GetEncoding(path, Encoding.GetEncoding("GB2312"));    using (var sr = new StreamReader(path, enc))    {        Console.WriteLine(sr.ReadToEnd());    }}/// <summary>/// 根据文件尝试返回字符编码/// </summary>/// <param name="file">文件路径</param>/// <param name="defEnc">没有BOM返回的默认编码</param>/// <returns>如果文件无法读取，返回null。否则，返回根据BOM判断的编码或者缺省编码（没有BOM）。</returns>static Encoding GetEncoding(string file, Encoding defEnc){    using (var stream = File.OpenRead(file))    {        //判断流可读？        if (!stream.CanRead)            return null;        //字节数组存储BOM        var bom = new byte[4];        //实际读入的长度        int readc;        readc = stream.Read(bom, 0, 4);        if (readc >= 2)        {            if (readc >= 4)            {                //UTF32，Big-Endian                if (CheckBytes(bom, 4, 0x00, 0x00, 0xFE, 0xFF))                    return new UTF32Encoding(true, true);                //UTF32，Little-Endian                if (CheckBytes(bom, 4, 0xFF, 0xFE, 0x00, 0x00))                    return new UTF32Encoding(false, true);            }            //UTF8            if (readc >= 3 && CheckBytes(bom, 3, 0xEF, 0xBB, 0xBF))                return new UTF8Encoding(true);            //UTF16，Big-Endian            if (CheckBytes(bom, 2, 0xFE, 0xFF))                return new UnicodeEncoding(true, true);            //UTF16，Little-Endian            if (CheckBytes(bom, 2, 0xFF, 0xFE))                return new UnicodeEncoding(false, true);        }        return defEnc;    }}//辅助函数，判断字节中的值static bool CheckBytes(byte[] bytes, int count, params int[] values){    for (int i = 0; i < count; i++)        if (bytes[i] != values[i])            return false;    return true;}

In the above code, for Unicode text, the GetEncoding method will return UTF16 encoding (more specifically: it will also return Big or Little-Endian UTF16 encoding according to BOM), without BOM The file will return the default value GB2312 encoding.

.NET(C#): Detect the encoding from the file

.NET(C#): Character encoding (Encoding) and byte order mark (BOM) )

.NET(C#): Use the System.Text.Decoder class to process "stream text"

.NET(C#): A brief discussion of assembly manifest resources and RESX resources

The above is the detailed content of Example tutorial on correctly reading Chinese encoded files in .NET (C#). For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

C# and .NET: Understanding the Relationship Between the TwoApr 17, 2025 am 12:07 AM

The relationship between C# and .NET is inseparable, but they are not the same thing. C# is a programming language, while .NET is a development platform. C# is used to write code, compile into .NET's intermediate language (IL), and executed by the .NET runtime (CLR).

The Continued Relevance of C# .NET: A Look at Current UsageApr 16, 2025 am 12:07 AM

C#.NET is still important because it provides powerful tools and libraries that support multiple application development. 1) C# combines .NET framework to make development efficient and convenient. 2) C#'s type safety and garbage collection mechanism enhance its advantages. 3) .NET provides a cross-platform running environment and rich APIs, improving development flexibility.

From Web to Desktop: The Versatility of C# .NETApr 15, 2025 am 12:07 AM

C#.NETisversatileforbothwebanddesktopdevelopment.1)Forweb,useASP.NETfordynamicapplications.2)Fordesktop,employWindowsFormsorWPFforrichinterfaces.3)UseXamarinforcross-platformdevelopment,enablingcodesharingacrossWindows,macOS,Linux,andmobiledevices.

C# .NET and the Future: Adapting to New TechnologiesApr 14, 2025 am 12:06 AM

C# and .NET adapt to the needs of emerging technologies through continuous updates and optimizations. 1) C# 9.0 and .NET5 introduce record type and performance optimization. 2) .NETCore enhances cloud native and containerized support. 3) ASP.NETCore integrates with modern web technologies. 4) ML.NET supports machine learning and artificial intelligence. 5) Asynchronous programming and best practices improve performance.

Is C# .NET Right for You? Evaluating its ApplicabilityApr 13, 2025 am 12:03 AM

C#.NETissuitableforenterprise-levelapplicationswithintheMicrosoftecosystemduetoitsstrongtyping,richlibraries,androbustperformance.However,itmaynotbeidealforcross-platformdevelopmentorwhenrawspeediscritical,wherelanguageslikeRustorGomightbepreferable.

C# Code within .NET: Exploring the Programming ProcessApr 12, 2025 am 12:02 AM

The programming process of C# in .NET includes the following steps: 1) writing C# code, 2) compiling into an intermediate language (IL), and 3) executing by the .NET runtime (CLR). The advantages of C# in .NET are its modern syntax, powerful type system and tight integration with the .NET framework, suitable for various development scenarios from desktop applications to web services.

C# .NET: Exploring Core Concepts and Programming FundamentalsApr 10, 2025 am 09:32 AM

C# is a modern, object-oriented programming language developed by Microsoft and as part of the .NET framework. 1.C# supports object-oriented programming (OOP), including encapsulation, inheritance and polymorphism. 2. Asynchronous programming in C# is implemented through async and await keywords to improve application responsiveness. 3. Use LINQ to process data collections concisely. 4. Common errors include null reference exceptions and index out-of-range exceptions. Debugging skills include using a debugger and exception handling. 5. Performance optimization includes using StringBuilder and avoiding unnecessary packing and unboxing.

Testing C# .NET Applications: Unit, Integration, and End-to-End TestingApr 09, 2025 am 12:04 AM

Testing strategies for C#.NET applications include unit testing, integration testing, and end-to-end testing. 1. Unit testing ensures that the minimum unit of the code works independently, using the MSTest, NUnit or xUnit framework. 2. Integrated tests verify the functions of multiple units combined, commonly used simulated data and external services. 3. End-to-end testing simulates the user's complete operation process, and Selenium is usually used for automated testing.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Notepad++7.3.1

Easy-to-use and free code editor

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Hot Topics

Where is the login entrance for gmail email?

7530

CakePHP Tutorial

1379

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers