search
HomeBackend DevelopmentC#.Net TutorialExample tutorial on correctly reading Chinese encoded files in .NET (C#)

First of all, if the reader is not familiar with encoding or BOM, it is recommended to read this article first: .NET (C#): Character Encoding (Encoding) and Byte Order Mark (BOM).
Chinese coding can basically be divided into two categories:
1. Extended set of ANSI coding: such as GBK, GB2312, GB18030, etc. There is no BOM for this type of coding (some newer standard Chinese coding, such as GB18030 and GBK encoding, all are backward compatible with GB2312 encoding).
2. Unicode encoding set: such as UTF-8, UTF-16, UTF-32, etc. This type of coding can have BOM or not.
3. Some Unicode encodings also have specific byte order issues (Endianess), which are the so-called Little endian and Big endian. Different section orders correspond to different BOMs, such as UTF16, but UTF8 does not have byte order issues. .

OK, after understanding the basic knowledge, let us return to the topic, how to open Chinese text files correctly. The first information that needs to be confirmed is: Does your Unicode encoded file contain a BOM?

If BOM is included, then everything is easy to say! Because if we find the BOM, we will know its specific encoding. If the BOM is not found, it is not Unicode. We can open the text file using the system's default ANSI extended Chinese encoding set and it will be OK.
If the Unicode encoding does not have a BOM (obviously, you cannot guarantee that all Unicode files given to you by users have BOM), then you have to manually determine whether it is GBK from the original bytes? Or UTF8? Or other encoding? . This requires a specific encoding detection algorithm (you can google "charset|encoding detection"). Of course, the encoding detection algorithm may not be 100% accurate. It is precisely because of this that Windows Notepad has Bush hid the facts bug. When browsing the web in Chrome, you will also encounter garbled characters. Personally, I feel that Notepad++'s coding awareness is quite accurate.
There are many coding awareness algorithms, such as this project: https://code.google.com/p/ude


If Unicode comes with BOM, there is no need for a third-party class library . However, there are some things that need to be explained.

The problem is that the text reading methods (File class and StreamReader) in .NET read in UTF8 encoding by default, so this type of GBK text file is directly opened with .NET (if no encoding is specified). It must be gibberish!

First of all, the most effective solution here is to use the system default ANSI extended encoding, which is the system default non-Unicode encoding to read text. Reference code:

//输出系统默认非Unicode编码Console.WriteLine(Encoding.Default.EncodingName);//使用系统默认非Unicode编码来打开文件var fileContent = File.ReadAllText("C:\test.txt", Encoding.Default);

in Simplified Chinese Windows The system should output:

Simplified Chinese (GB2312)...

And using this method is not limited to Simplified Chinese.

Of course, you can also manually specify an encoding, such as GBK encoding, but if you use the specified GBK encoding to open a Unicode file, will the file still be opened successfully? The answer is still successful. The reason is that .NET will automatically detect the BOM by default when opening a file and use the encoding obtained based on the BOM to open the file. If there is no BOM, the file will be opened with the encoding area specified by the user. If the user does not specify the encoding, UTF8 encoding will be used.

This "automatically aware of BOM" parameter can be set in the constructor of StreamReader, corresponding to the detectEncodingFromByteOrderMarks parameter.

But it cannot be set in the corresponding method of the File class. (For example: File.ReadAllText).

For example, the following code uses:

GB2312 encoding, automatically detecting BOM to read GB2312 text

GB2312 encoding, automatically detecting BOM to read Unicode text

GB2312 encoding, reading Unicode text without noticing the BOM

static void Main(){    var gb2312 = Encoding.GetEncoding("GB2312");    //用GB2312编码,自动觉察BOM 来读取GB2312文本    ReadFile("gbk.txt", gb2312, true);    //用GB2312编码,自动觉察BOM 来读取Unicode文本    ReadFile("unicode.txt", gb2312, true);    //用GB2312编码,不觉察BOM 来读取Unicode文本    ReadFile("unicode.txt", gb2312, false);}//通过StreamReader读取文本 static void ReadFile(string path, Encoding enc, bool detectEncodingFromByteOrderMarks){    StreamReader sr;    using (sr = new StreamReader(path, enc, detectEncodingFromByteOrderMarks))    {        Console.WriteLine(sr.ReadToEnd());    }}

Output:

a刘a刘???

The third line is garbled.

Seeing the above, using GB2312 encoding to open Unicode files will also be successful. Because the "Automatically detect BOM" parameter is True, when it is found that the file has a BOM, .NET will detect that it is a Unicode file through the BOM, and then use Unicode to open the file. Of course, if there is no BOM, the specified encoding parameters will be used to open the file. For GB2312 encoded text, there is obviously no BOM, so GB2312 encoding must be specified, otherwise .NET will use the default UTF8 encoding to parse the file, and the result will not be read. The reason for the garbled characters in the third line is that "automatically detect BOM" is False. .NET will directly use the specified GB2312 encoding to read a Unicode encoded text file with BOM, which obviously cannot be successful.

Of course, you can also determine the BOM yourself. If there is no BOM, specify a default encoding to open the text. I wrote about it in a previous article (.NET (C#): Encoding detection from files).

Code:

static void Main(){    PrintText("gb2312.txt");    PrintText("unicode.txt");}//根据文件自动觉察编码并输出内容static void PrintText(string path){    var enc = GetEncoding(path, Encoding.GetEncoding("GB2312"));    using (var sr = new StreamReader(path, enc))    {        Console.WriteLine(sr.ReadToEnd());    }}/// <summary>/// 根据文件尝试返回字符编码/// </summary>/// <param name="file">文件路径</param>/// <param name="defEnc">没有BOM返回的默认编码</param>/// <returns>如果文件无法读取,返回null。否则,返回根据BOM判断的编码或者缺省编码(没有BOM)。</returns>static Encoding GetEncoding(string file, Encoding defEnc){    using (var stream = File.OpenRead(file))    {        //判断流可读?        if (!stream.CanRead)            return null;        //字节数组存储BOM        var bom = new byte[4];        //实际读入的长度        int readc;        readc = stream.Read(bom, 0, 4);        if (readc >= 2)        {            if (readc >= 4)            {                //UTF32,Big-Endian                if (CheckBytes(bom, 4, 0x00, 0x00, 0xFE, 0xFF))                    return new UTF32Encoding(true, true);                //UTF32,Little-Endian                if (CheckBytes(bom, 4, 0xFF, 0xFE, 0x00, 0x00))                    return new UTF32Encoding(false, true);            }            //UTF8            if (readc >= 3 && CheckBytes(bom, 3, 0xEF, 0xBB, 0xBF))                return new UTF8Encoding(true);            //UTF16,Big-Endian            if (CheckBytes(bom, 2, 0xFE, 0xFF))                return new UnicodeEncoding(true, true);            //UTF16,Little-Endian            if (CheckBytes(bom, 2, 0xFF, 0xFE))                return new UnicodeEncoding(false, true);        }        return defEnc;    }}//辅助函数,判断字节中的值static bool CheckBytes(byte[] bytes, int count, params int[] values){    for (int i = 0; i < count; i++)        if (bytes[i] != values[i])            return false;    return true;}

In the above code, for Unicode text, the GetEncoding method will return UTF16 encoding (more specifically: it will also return Big or Little-Endian UTF16 encoding according to BOM), without BOM The file will return the default value GB2312 encoding.

Related Posts:

.NET(C#): Detect the encoding from the file

.NET(C#): Character encoding (Encoding) and byte order mark (BOM) )

.NET(C#): Use the System.Text.Decoder class to process "stream text"

.NET(C#): A brief discussion of assembly manifest resources and RESX resources

The above is the detailed content of Example tutorial on correctly reading Chinese encoded files in .NET (C#). For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
如何使用C#编写时间序列预测算法如何使用C#编写时间序列预测算法Sep 19, 2023 pm 02:33 PM

如何使用C#编写时间序列预测算法时间序列预测是一种通过分析过去的数据来预测未来数据趋势的方法。它在很多领域,如金融、销售和天气预报中有广泛的应用。在本文中,我们将介绍如何使用C#编写时间序列预测算法,并附上具体的代码示例。数据准备在进行时间序列预测之前,首先需要准备好数据。一般来说,时间序列数据应该具有足够的长度,并且是按照时间顺序排列的。你可以从数据库或者

如何使用Redis和C#开发分布式事务功能如何使用Redis和C#开发分布式事务功能Sep 21, 2023 pm 02:55 PM

如何使用Redis和C#开发分布式事务功能引言分布式系统的开发中,事务处理是一项非常重要的功能。事务处理能够保证在分布式系统中的一系列操作要么全部成功,要么全部回滚。Redis是一种高性能的键值存储数据库,而C#是一种广泛应用于开发分布式系统的编程语言。本文将介绍如何使用Redis和C#来实现分布式事务功能,并提供具体代码示例。I.Redis事务Redis

如何实现C#中的人脸识别算法如何实现C#中的人脸识别算法Sep 19, 2023 am 08:57 AM

如何实现C#中的人脸识别算法人脸识别算法是计算机视觉领域中的一个重要研究方向,它可以用于识别和验证人脸,广泛应用于安全监控、人脸支付、人脸解锁等领域。在本文中,我们将介绍如何使用C#来实现人脸识别算法,并提供具体的代码示例。实现人脸识别算法的第一步是获取图像数据。在C#中,我们可以使用EmguCV库(OpenCV的C#封装)来处理图像。首先,我们需要在项目

Redis在C#开发中的应用:如何实现高效的缓存更新Redis在C#开发中的应用:如何实现高效的缓存更新Jul 30, 2023 am 09:46 AM

Redis在C#开发中的应用:如何实现高效的缓存更新引言:在Web开发中,缓存是提高系统性能的常用手段之一。而Redis作为一款高性能的Key-Value存储系统,能够提供快速的缓存操作,为我们的应用带来了不少便利。本文将介绍如何在C#开发中使用Redis,实现高效的缓存更新。Redis的安装与配置在开始之前,我们需要先安装Redis并进行相应的配置。你可以

分享几个.NET开源的AI和LLM相关项目框架分享几个.NET开源的AI和LLM相关项目框架May 06, 2024 pm 04:43 PM

当今人工智能(AI)技术的发展如火如荼,它们在各个领域都展现出了巨大的潜力和影响力。今天大姚给大家分享4个.NET开源的AI模型LLM相关的项目框架,希望能为大家提供一些参考。https://github.com/YSGStudyHards/DotNetGuide/blob/main/docs/DotNet/DotNetProjectPicks.mdSemanticKernelSemanticKernel是一种开源的软件开发工具包(SDK),旨在将大型语言模型(LLM)如OpenAI、Azure

如何使用C#编写动态规划算法如何使用C#编写动态规划算法Sep 20, 2023 pm 04:03 PM

如何使用C#编写动态规划算法摘要:动态规划是求解最优化问题的一种常用算法,适用于多种场景。本文将介绍如何使用C#编写动态规划算法,并提供具体的代码示例。一、什么是动态规划算法动态规划(DynamicProgramming,简称DP)是一种用来求解具有重叠子问题和最优子结构性质的问题的算法思想。动态规划将问题分解成若干个子问题来求解,通过记录每个子问题的解,

C#的就业前景如何C#的就业前景如何Oct 19, 2023 am 11:02 AM

无论您是初学者还是有经验的专业人士,掌握C#将为您的职业发展铺平道路。

如何实现C#中的图像压缩算法如何实现C#中的图像压缩算法Sep 19, 2023 pm 02:12 PM

如何实现C#中的图像压缩算法摘要:图像压缩是图像处理领域中的一个重要研究方向,本文将介绍在C#中实现图像压缩的算法,并给出相应的代码示例。引言:随着数字图像的广泛应用,图像压缩成为了图像处理中的重要环节。压缩能够减小存储空间和传输带宽,并能提高图像处理的效率。在C#语言中,我们可以通过使用各种图像压缩算法来实现对图像的压缩。本文将介绍两种常见的图像压缩算法:

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),