Home > Article > Backend Development > Ignore lines containing pattern in long text file in Go
php Editor Apple In the Go language, we often need to process large text files. Sometimes we are only interested in rows containing a specific pattern and ignore other rows. Fortunately, in Go, we can use regular expressions and bufio.Scanner to achieve this goal. By using regular expressions to match lines and running the file through a Scanner line by line, we can easily filter out lines that we are not interested in. This tip not only improves efficiency, but also makes our code more concise and readable. Next, let’s take a look at how to ignore lines containing patterns in long text files in Go.
I'm trying to implement a function to ignore lines containing patterns in long text files (guaranteed ascii) in go
My functions in withoutignore
and withignore
both accept a filename parameter as input and return *byte.buffer
, which can then be used to write io.writer
.
withignore
The function takes additional arguments pattern
to exclude lines containing a pattern from the file. The function works, but through benchmarking it was found to be 5 times slower than without ignoring
. Is there any way it can be improved?
package main import ( "bufio" "bytes" "io" "log" "os" ) func withoutignore(f string) (*bytes.buffer, error) { rfd, err := os.open(f) if err != nil { log.fatal(err) } defer func() { if err := rfd.close(); err != nil { log.fatal(err) } }() inputbuffer := make([]byte, 1048576) var bytesread int var bs []byte opbuffer := bytes.newbuffer(bs) for { bytesread, err = rfd.read(inputbuffer) if err == io.eof { return opbuffer, nil } if err != nil { return nil, nil } _, err = opbuffer.write(inputbuffer[:bytesread]) if err != nil { return nil, err } } return opbuffer, nil } func withignore(f, pattern string) (*bytes.buffer, error) { rfd, err := os.open(f) if err != nil { log.fatal(err) } defer func() { if err := rfd.close(); err != nil { log.fatal(err) } }() scanner := bufio.newscanner(rfd) var bs []byte buffer := bytes.newbuffer(bs) for scanner.scan() { if !bytes.contains(scanner.bytes(), []byte(pattern)) { _, err := buffer.writestring(scanner.text() + "\n") if err != nil { return nil, nil } } } return buffer, nil } func main() { // buff, err := withoutignore("base64dump.log") buff, err := withignore("base64dump.log", "audit") if err != nil { log.fatal(err) } _, err = buff.writeto(os.stdout) if err != nil { log.fatal(err) } }
Benchmarks
package main import "testing" func benchmarktestwithoutignore(b *testing.b) { for i := 0; i < b.n; i++ { _, err := withoutignore("base64dump.log") if err != nil { b.fatal(err) } } } func benchmarktestwithignore(b *testing.b) { for i := 0; i < b.n; i++ { _, err := withignore("base64dump.log", "audit") if err != nil { b.fatal(err) } } }
and can be generated from the command line using "base64dump.log"
base64 /dev/urandom | head -c 10000000 > base64dump.log
Since ascii is guaranteed, it can work directly at the byte
level.
However, if you check each byte for a newline character while reading the input, and then search again for the pattern within the line, the operation will be applied to each byte.
On the other hand, if you read a block of input and perform an optimized search for patterns in the text, without even checking each input byte, you can minimize the number of operations per input byte.
For example, boyer-moore string search algorithm. Go's built-in bytes.index
function has also been optimized. The speed achieved will of course depend on the input data and the actual mode. For the input specified in the question, the performance of `bytes.index improves significantly when measured.
program
bytes.index
you can find where in the block the pattern occursNoteworthy
A read operation may return less data than the block size, so it makes sense to repeat the read operation until the block size of data is read.
Benchmark
Optimized code is usually much more complex, but also performs significantly better, as we will see later.
benchmarktestwithoutignore-8 270 4137267 ns/op benchmarktestwithignore-8 54 22403931 ns/op benchmarktestfilter-8 150 7947454 ns/op
Here, the optimized code benchmarktestfilter-8
is only about 1.9 times slower than the operation without filtering, while the benchmarktestwithignore-8
method is 5.4 times slower than the comparison value without filtering.
Looking at it from another perspective: the optimized code is 2.8 times faster than the unoptimized code.
Code
Of course, this is the code for your own testing:
func filterfile(f, pattern string) (*bytes.buffer, error) { rfd, err := os.open(f) if err != nil { log.fatal(err) } defer func() { if err := rfd.close(); err != nil { log.fatal(err) } }() reader := bufio.newreader(rfd) return filter(reader, []byte(pattern), 1024*1024) } // chunksize must be larger than the longest line // a reasonable size is probably >= 64k func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error) { var bs []byte buffer := bytes.newbuffer(bs) chunk := make([]byte, chunksize) var remaining []byte for lastchunk := false; !lastchunk; { n, err := readchunk(reader, chunk, remaining, chunksize) if err != nil { if err == io.eof { lastchunk = true } else { return nil, err } } remaining = remaining[:0] if !lastchunk { for i := n - 1; i > 0; i-- { if chunk[i] == '\n' { remaining = append(remaining, chunk[i+1:n]...) n = i + 1 break } } } s := 0 for s < n { hit := bytes.index(chunk[s:n], pattern) if hit < 0 { break } hit += s startofline := hit for ; startofline > 0; startofline-- { if chunk[startofline] == '\n' { startofline++ break } } endofline := hit + len(pattern) for ; endofline < n; endofline++ { if chunk[endofline] == '\n' { break } } endofline++ _, err = buffer.write(chunk[s:startofline]) if err != nil { return nil, err } s = endofline } if s < n { _, err = buffer.write(chunk[s:n]) if err != nil { return nil, err } } } return buffer, nil } func readchunk(reader io.reader, chunk, remaining []byte, chunksize int) (int, error) { copy(chunk, remaining) r := len(remaining) for r < chunksize { n, err := reader.read(chunk[r:]) r += n if err != nil { return r, err } } return r, nil }
The benchmark section might look like this:
func BenchmarkTestFilter(b *testing.B) { for i := 0; i < b.N; i++ { _, err := filterFile("base64dump.log", "AUDIT") if err != nil { b.Fatal(err) } } }
The filter function is split, and the actual work is done in func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error)
.
The creation of unit tests has been prepared or considered by injecting the reader and chunksize, which is missing here but is definitely recommended when working with indexes.
However, the point here is to find a way to significantly improve performance.
The above is the detailed content of Ignore lines containing pattern in long text file in Go. For more information, please follow other related articles on the PHP Chinese website!