Home >Backend Development >Golang >How to Correctly Read and Parse UTF-16 Encoded Text Files in Go?

How to Correctly Read and Parse UTF-16 Encoded Text Files in Go?

DDD
DDDOriginal
2024-12-30 21:16:17936browse

How to Correctly Read and Parse UTF-16 Encoded Text Files in Go?

How to Read UTF-16 Text Files into Strings in Go

When dealing with text files encoded in UTF-16, Go's standard bufio package may not interpret Unicode characters correctly due to its limitations in handling line breaks. This can lead to issues with converting the file's contents into a string and preserving the intended Unicode values.

One solution is to use the latest version of golang.org/x/text/encoding/unicode, which introduces unicode.BOMOverride. This function intelligently detects the byte order mark (BOM) and decodes the file accordingly:

package main

import (
    "bytes"
    "fmt"
    "io/ioutil"
    "log"
    "strings"

    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
)

// ReadFileUTF16 is similar to ioutil.ReadFile() but decodes UTF-16.
func ReadFileUTF16(filename string) ([]byte, error) {
    raw, err := ioutil.ReadFile(filename)
    if err != nil {
        return nil, err
    }

    win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
    utf16bom := unicode.BOMOverride(win16be.NewDecoder())

    unicodeReader := transform.NewReader(bytes.NewReader(raw), utf16bom)

    decoded, err := ioutil.ReadAll(unicodeReader)
    return decoded, err
}

func main() {
    data, err := ReadFileUTF16("inputfile.txt")
    if err != nil {
        log.Fatal(err)
    }
    final := strings.Replace(string(data), "\r\n", "\n", -1)
    fmt.Println(final)
}

For handling line-by-line text parsing, you can use NewScannerUTF16:

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"

    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
)

type utfScanner interface {
    Read(p []byte) (n int, err error)
}

// NewScannerUTF16 creates a scanner similar to os.Open() but decodes the file as UTF-16.
func NewScannerUTF16(filename string) (utfScanner, error) {
    file, err := os.Open(filename)
    if err != nil {
        return nil, err
    }

    win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
    utf16bom := unicode.BOMOverride(win16be.NewDecoder())

    unicodeReader := transform.NewReader(file, utf16bom)
    return unicodeReader, nil
}

func main() {
    s, err := NewScannerUTF16("inputfile.txt")
    if err != nil {
        log.Fatal(err)
    }

    scanner := bufio.NewScanner(s)
    for scanner.Scan() {
        fmt.Println(scanner.Text())
    }
    if err := scanner.Err(); err != nil {
        fmt.Fprintln(os.Stderr, "reading inputfile:", err)
    }
}

The above is the detailed content of How to Correctly Read and Parse UTF-16 Encoded Text Files in Go?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn