Home >Backend Development >Golang >How to Read Non-UTF-8 Encoded Text Files (e.g., GBK) in Go?

How to Read Non-UTF-8 Encoded Text Files (e.g., GBK) in Go?

Susan Sarandon
Susan SarandonOriginal
2024-12-01 12:37:16684browse

How to Read Non-UTF-8 Encoded Text Files (e.g., GBK) in Go?

Reading Non-UTF-8 Text Files in Go

Reading and writing non-UTF-8 text files can be challenging in Go since the standard library assumes UTF-8 encoding. This article addresses this issue and provides a comprehensive solution using Go's sub-repositories.

Problem:

How can we read text files encoded in non-UTF-8 formats, such as GBK, in Go?

Solution:

To read files in non-UTF-8 encodings, we utilize the golang.org/x/text/encoding package. This package defines an interface for generic character encodings that facilitate conversion to and from UTF-8.

In particular, for GBK encoding, we employ the golang.org/x/text/encoding/simplifiedchinese sub-package, which provides GB18030, GBK, and HZ-GB2312 encoding implementations. These implementations implement the encoding.Encoding interface.

Implementation:

Here is an example demonstrating the reading and writing of a GBK-encoded file:

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"

    "golang.org/x/text/encoding/simplifiedchinese"
    "golang.org/x/text/transform"
)

var enc = simplifiedchinese.GBK

func main() {
    // Example filename
    const filename = "example_GBK_file"

    exampleWriteGBK(filename)
    exampleReadGBK(filename)
}

func exampleReadGBK(filename string) {
    f, err := os.Open(filename)
    if err != nil {
        log.Fatal(err)
    }

    // Convert GBK to UTF-8 on the fly
    r := transform.NewReader(f, enc.NewDecoder())

    sc := bufio.NewScanner(r)
    for sc.Scan() {
        fmt.Printf("Read line: %s\n", sc.Bytes())
    }
    if err := sc.Err(); err != nil {
        log.Fatal(err)
    }
}

func exampleWriteGBK(filename string) {
    f, err := os.Create(filename)
    if err != nil {
        log.Fatal(err)
    }

    w := transform.NewWriter(f, enc.NewEncoder())

    // Example text with Chinese characters
    _, err = fmt.Fprintln(w,
        `In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
    if err != nil {
        log.Fatal(err)
    }
}

Playground:

https://go.dev/play/p/fFIy9VES6cL

The above is the detailed content of How to Read Non-UTF-8 Encoded Text Files (e.g., GBK) in Go?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn