search
HomeBackend DevelopmentGolangIgnore lines containing pattern in long text file in Go

在 Go 中忽略长文本文件中包含模式的行

php Editor Apple In the Go language, we often need to process large text files. Sometimes we are only interested in rows containing a specific pattern and ignore other rows. Fortunately, in Go, we can use regular expressions and bufio.Scanner to achieve this goal. By using regular expressions to match lines and running the file through a Scanner line by line, we can easily filter out lines that we are not interested in. This tip not only improves efficiency, but also makes our code more concise and readable. Next, let’s take a look at how to ignore lines containing patterns in long text files in Go.

Question content

I'm trying to implement a function to ignore lines containing patterns in long text files (guaranteed ascii) in go

My functions in withoutignore and withignore both accept a filename parameter as input and return *byte.buffer, which can then be used to write io.writer.

withignore The function takes additional arguments pattern to exclude lines containing a pattern from the file. The function works, but through benchmarking it was found to be 5 times slower than without ignoring . Is there any way it can be improved?

package main

import (
    "bufio"
    "bytes"
    "io"
    "log"
    "os"
)

func withoutignore(f string) (*bytes.buffer, error) {
    rfd, err := os.open(f)
    if err != nil {
        log.fatal(err)
    }

    defer func() {
        if err := rfd.close(); err != nil {
            log.fatal(err)
        }
    }()

    inputbuffer := make([]byte, 1048576)
    var bytesread int

    var bs []byte
    opbuffer := bytes.newbuffer(bs)

    for {
        bytesread, err = rfd.read(inputbuffer)

        if err == io.eof {
            return opbuffer, nil
        }

        if err != nil {
            return nil, nil
        }

        _, err = opbuffer.write(inputbuffer[:bytesread])
        if err != nil {
            return nil, err
        }
    }
    return opbuffer, nil
}

func withignore(f, pattern string) (*bytes.buffer, error) {
    rfd, err := os.open(f)
    if err != nil {
        log.fatal(err)
    }

    defer func() {
        if err := rfd.close(); err != nil {
            log.fatal(err)
        }
    }()

    scanner := bufio.newscanner(rfd)
    var bs []byte
    buffer := bytes.newbuffer(bs)
    for scanner.scan() {
        if !bytes.contains(scanner.bytes(), []byte(pattern)) {
            _, err := buffer.writestring(scanner.text() + "\n")
            if err != nil {
                return nil, nil
            }
        }
    }

    return buffer, nil
}

func main() {
    // buff, err := withoutignore("base64dump.log")
    buff, err := withignore("base64dump.log", "audit")
    if err != nil {
        log.fatal(err)
    }

    _, err = buff.writeto(os.stdout)
    if err != nil {
        log.fatal(err)
    }
}

Benchmarks

package main

import "testing"

func benchmarktestwithoutignore(b *testing.b) {
    for i := 0; i < b.n; i++ {
        _, err := withoutignore("base64dump.log")
        if err != nil {
            b.fatal(err)
        }
    }
}

func benchmarktestwithignore(b *testing.b) {
    for i := 0; i < b.n; i++ {
        _, err := withignore("base64dump.log", "audit")
        if err != nil {
            b.fatal(err)
        }
    }
}

and can be generated from the command line using "base64dump.log"

base64 /dev/urandom | head -c 10000000 > base64dump.log

Solution

Since ascii is guaranteed, it can work directly at the byte level.

However, if you check each byte for a newline character while reading the input, and then search again for the pattern within the line, the operation will be applied to each byte.

On the other hand, if you read a block of input and perform an optimized search for patterns in the text, without even checking each input byte, you can minimize the number of operations per input byte.

For example, boyer-moore string search algorithm. Go's built-in bytes.index function has also been optimized. The speed achieved will of course depend on the input data and the actual mode. For the input specified in the question, the performance of `bytes.index improves significantly when measured.

program

  • Reading a block where the block size should be significantly longer than the maximum line length, a value >= 64kb should probably be good, in testing the 1mb in the question was used.
  • A block usually does not end with a newline character, so search from the end of the block to the next newline character, limiting the search to this slice and remembering the remaining data for the next pass
  • The last block does not necessarily end with a newline character
  • With the help of the high-performance go function bytes.index you can find where in the block the pattern occurs
  • Search for the preceding and following newline characters from the found position
  • The block is then output to the beginning of the corresponding line
  • And continue searching from the end of the line where the pattern appears
  • If the search does not find other locations, output the remaining locations
  • Read the next block and apply the steps described again until the end of the file is reached

Noteworthy

A read operation may return less data than the block size, so it makes sense to repeat the read operation until the block size of data is read.

Benchmark

Optimized code is usually much more complex, but also performs significantly better, as we will see later.

benchmarktestwithoutignore-8         270       4137267 ns/op
benchmarktestwithignore-8             54      22403931 ns/op
benchmarktestfilter-8                150       7947454 ns/op

Here, the optimized code benchmarktestfilter-8 is only about 1.9 times slower than the operation without filtering, while the benchmarktestwithignore-8 method is 5.4 times slower than the comparison value without filtering.

Looking at it from another perspective: the optimized code is 2.8 times faster than the unoptimized code.

Code

Of course, this is the code for your own testing:

func filterfile(f, pattern string) (*bytes.buffer, error) {
    rfd, err := os.open(f)
    if err != nil {
        log.fatal(err)
    }
    defer func() {
        if err := rfd.close(); err != nil {
            log.fatal(err)
        }
    }()

    reader := bufio.newreader(rfd)
    return filter(reader, []byte(pattern), 1024*1024)
}

// chunksize must be larger than the longest line
// a reasonable size is probably >= 64k
func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error) {
    var bs []byte
    buffer := bytes.newbuffer(bs)

    chunk := make([]byte, chunksize)

    var remaining []byte
    for lastchunk := false; !lastchunk; {
        n, err := readchunk(reader, chunk, remaining, chunksize)
        if err != nil {
            if err == io.eof {
                lastchunk = true
            } else {
                return nil, err
            }
        }

        remaining = remaining[:0]
        if !lastchunk {
            for i := n - 1; i > 0; i-- {
                if chunk[i] == '\n' {
                    remaining = append(remaining, chunk[i+1:n]...)
                    n = i + 1
                    break
                }
            }
        }

        s := 0
        for s < n {
            hit := bytes.index(chunk[s:n], pattern)
            if hit < 0 {
                break
            }
            hit += s
            startofline := hit
            for ; startofline > 0; startofline-- {
                if chunk[startofline] == '\n' {
                    startofline++
                    break
                }
            }
            endofline := hit + len(pattern)
            for ; endofline < n; endofline++ {
                if chunk[endofline] == '\n' {
                    break
                }
            }
            endofline++

            _, err = buffer.write(chunk[s:startofline])
            if err != nil {
                return nil, err
            }
            s = endofline
        }

        if s < n {
            _, err = buffer.write(chunk[s:n])
            if err != nil {
                return nil, err
            }
        }
    }

    return buffer, nil
}

func readchunk(reader io.reader, chunk, remaining []byte, chunksize int) (int, error) {
    copy(chunk, remaining)
    r := len(remaining)
    for r < chunksize {
        n, err := reader.read(chunk[r:])
        r += n
        if err != nil {
            return r, err
        }
    }
    return r, nil
}

The benchmark section might look like this:

func BenchmarkTestFilter(b *testing.B) {
    for i := 0; i < b.N; i++ {
        _, err := filterFile("base64dump.log", "AUDIT")
        if err != nil {
            b.Fatal(err)
        }
    }
}

The filter function is split, and the actual work is done in func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error).

The creation of unit tests has been prepared or considered by injecting the reader and chunksize, which is missing here but is definitely recommended when working with indexes.

However, the point here is to find a way to significantly improve performance.

The above is the detailed content of Ignore lines containing pattern in long text file in Go. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:stackoverflow. If there is any infringement, please contact admin@php.cn delete
go语言有没有缩进go语言有没有缩进Dec 01, 2022 pm 06:54 PM

go语言有缩进。在go语言中,缩进直接使用gofmt工具格式化即可(gofmt使用tab进行缩进);gofmt工具会以标准样式的缩进和垂直对齐方式对源代码进行格式化,甚至必要情况下注释也会重新格式化。

go语言为什么叫gogo语言为什么叫goNov 28, 2022 pm 06:19 PM

go语言叫go的原因:想表达这门语言的运行速度、开发速度、学习速度(develop)都像gopher一样快。gopher是一种生活在加拿大的小动物,go的吉祥物就是这个小动物,它的中文名叫做囊地鼠,它们最大的特点就是挖洞速度特别快,当然可能不止是挖洞啦。

一文详解Go中的并发【20 张动图演示】一文详解Go中的并发【20 张动图演示】Sep 08, 2022 am 10:48 AM

Go语言中各种并发模式看起来是怎样的?下面本篇文章就通过20 张动图为你演示 Go 并发,希望对大家有所帮助!

【整理分享】一些GO面试题(附答案解析)【整理分享】一些GO面试题(附答案解析)Oct 25, 2022 am 10:45 AM

本篇文章给大家整理分享一些GO面试题集锦快答,希望对大家有所帮助!

tidb是go语言么tidb是go语言么Dec 02, 2022 pm 06:24 PM

是,TiDB采用go语言编写。TiDB是一个分布式NewSQL数据库;它支持水平弹性扩展、ACID事务、标准SQL、MySQL语法和MySQL协议,具有数据强一致的高可用特性。TiDB架构中的PD储存了集群的元信息,如key在哪个TiKV节点;PD还负责集群的负载均衡以及数据分片等。PD通过内嵌etcd来支持数据分布和容错;PD采用go语言编写。

go语言能不能编译go语言能不能编译Dec 09, 2022 pm 06:20 PM

go语言能编译。Go语言是编译型的静态语言,是一门需要编译才能运行的编程语言。对Go语言程序进行编译的命令有两种:1、“go build”命令,可以将Go语言程序代码编译成二进制的可执行文件,但该二进制文件需要手动运行;2、“go run”命令,会在编译后直接运行Go语言程序,编译过程中会产生一个临时文件,但不会生成可执行文件。

go语言是否需要编译go语言是否需要编译Dec 01, 2022 pm 07:06 PM

go语言需要编译。Go语言是编译型的静态语言,是一门需要编译才能运行的编程语言,也就说Go语言程序在运行之前需要通过编译器生成二进制机器码(二进制的可执行文件),随后二进制文件才能在目标机器上运行。

golang map怎么删除元素golang map怎么删除元素Dec 08, 2022 pm 06:26 PM

删除map元素的两种方法:1、使用delete()函数从map中删除指定键值对,语法“delete(map, 键名)”;2、重新创建一个新的map对象,可以清空map中的所有元素,语法“var mapname map[keytype]valuetype”。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software