php Editor Apple In the Go language, we often need to process large text files. Sometimes we are only interested in rows containing a specific pattern and ignore other rows. Fortunately, in Go, we can use regular expressions and bufio.Scanner to achieve this goal. By using regular expressions to match lines and running the file through a Scanner line by line, we can easily filter out lines that we are not interested in. This tip not only improves efficiency, but also makes our code more concise and readable. Next, let’s take a look at how to ignore lines containing patterns in long text files in Go.
Question content
I'm trying to implement a function to ignore lines containing patterns in long text files (guaranteed ascii) in go
My functions in withoutignore
and withignore
both accept a filename parameter as input and return *byte.buffer
, which can then be used to write io.writer
.
withignore
The function takes additional arguments pattern
to exclude lines containing a pattern from the file. The function works, but through benchmarking it was found to be 5 times slower than without ignoring
. Is there any way it can be improved?
package main import ( "bufio" "bytes" "io" "log" "os" ) func withoutignore(f string) (*bytes.buffer, error) { rfd, err := os.open(f) if err != nil { log.fatal(err) } defer func() { if err := rfd.close(); err != nil { log.fatal(err) } }() inputbuffer := make([]byte, 1048576) var bytesread int var bs []byte opbuffer := bytes.newbuffer(bs) for { bytesread, err = rfd.read(inputbuffer) if err == io.eof { return opbuffer, nil } if err != nil { return nil, nil } _, err = opbuffer.write(inputbuffer[:bytesread]) if err != nil { return nil, err } } return opbuffer, nil } func withignore(f, pattern string) (*bytes.buffer, error) { rfd, err := os.open(f) if err != nil { log.fatal(err) } defer func() { if err := rfd.close(); err != nil { log.fatal(err) } }() scanner := bufio.newscanner(rfd) var bs []byte buffer := bytes.newbuffer(bs) for scanner.scan() { if !bytes.contains(scanner.bytes(), []byte(pattern)) { _, err := buffer.writestring(scanner.text() + "\n") if err != nil { return nil, nil } } } return buffer, nil } func main() { // buff, err := withoutignore("base64dump.log") buff, err := withignore("base64dump.log", "audit") if err != nil { log.fatal(err) } _, err = buff.writeto(os.stdout) if err != nil { log.fatal(err) } }
Benchmarks
package main import "testing" func benchmarktestwithoutignore(b *testing.b) { for i := 0; i < b.n; i++ { _, err := withoutignore("base64dump.log") if err != nil { b.fatal(err) } } } func benchmarktestwithignore(b *testing.b) { for i := 0; i < b.n; i++ { _, err := withignore("base64dump.log", "audit") if err != nil { b.fatal(err) } } }
and can be generated from the command line using "base64dump.log"
base64 /dev/urandom | head -c 10000000 > base64dump.log
Solution
Since ascii is guaranteed, it can work directly at the byte
level.
However, if you check each byte for a newline character while reading the input, and then search again for the pattern within the line, the operation will be applied to each byte.
On the other hand, if you read a block of input and perform an optimized search for patterns in the text, without even checking each input byte, you can minimize the number of operations per input byte.
For example, boyer-moore string search algorithm. Go's built-in bytes.index
function has also been optimized. The speed achieved will of course depend on the input data and the actual mode. For the input specified in the question, the performance of `bytes.index improves significantly when measured.
program
- Reading a block where the block size should be significantly longer than the maximum line length, a value >= 64kb should probably be good, in testing the 1mb in the question was used.
- A block usually does not end with a newline character, so search from the end of the block to the next newline character, limiting the search to this slice and remembering the remaining data for the next pass
- The last block does not necessarily end with a newline character
- With the help of the high-performance go function
bytes.index
you can find where in the block the pattern occurs - Search for the preceding and following newline characters from the found position
- The block is then output to the beginning of the corresponding line
- And continue searching from the end of the line where the pattern appears
- If the search does not find other locations, output the remaining locations
- Read the next block and apply the steps described again until the end of the file is reached
Noteworthy
A read operation may return less data than the block size, so it makes sense to repeat the read operation until the block size of data is read.
Benchmark
Optimized code is usually much more complex, but also performs significantly better, as we will see later.
benchmarktestwithoutignore-8 270 4137267 ns/op benchmarktestwithignore-8 54 22403931 ns/op benchmarktestfilter-8 150 7947454 ns/op
Here, the optimized code benchmarktestfilter-8
is only about 1.9 times slower than the operation without filtering, while the benchmarktestwithignore-8
method is 5.4 times slower than the comparison value without filtering.
Looking at it from another perspective: the optimized code is 2.8 times faster than the unoptimized code.
Code
Of course, this is the code for your own testing:
func filterfile(f, pattern string) (*bytes.buffer, error) { rfd, err := os.open(f) if err != nil { log.fatal(err) } defer func() { if err := rfd.close(); err != nil { log.fatal(err) } }() reader := bufio.newreader(rfd) return filter(reader, []byte(pattern), 1024*1024) } // chunksize must be larger than the longest line // a reasonable size is probably >= 64k func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error) { var bs []byte buffer := bytes.newbuffer(bs) chunk := make([]byte, chunksize) var remaining []byte for lastchunk := false; !lastchunk; { n, err := readchunk(reader, chunk, remaining, chunksize) if err != nil { if err == io.eof { lastchunk = true } else { return nil, err } } remaining = remaining[:0] if !lastchunk { for i := n - 1; i > 0; i-- { if chunk[i] == '\n' { remaining = append(remaining, chunk[i+1:n]...) n = i + 1 break } } } s := 0 for s < n { hit := bytes.index(chunk[s:n], pattern) if hit < 0 { break } hit += s startofline := hit for ; startofline > 0; startofline-- { if chunk[startofline] == '\n' { startofline++ break } } endofline := hit + len(pattern) for ; endofline < n; endofline++ { if chunk[endofline] == '\n' { break } } endofline++ _, err = buffer.write(chunk[s:startofline]) if err != nil { return nil, err } s = endofline } if s < n { _, err = buffer.write(chunk[s:n]) if err != nil { return nil, err } } } return buffer, nil } func readchunk(reader io.reader, chunk, remaining []byte, chunksize int) (int, error) { copy(chunk, remaining) r := len(remaining) for r < chunksize { n, err := reader.read(chunk[r:]) r += n if err != nil { return r, err } } return r, nil }
The benchmark section might look like this:
func BenchmarkTestFilter(b *testing.B) { for i := 0; i < b.N; i++ { _, err := filterFile("base64dump.log", "AUDIT") if err != nil { b.Fatal(err) } } }
The filter function is split, and the actual work is done in func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error)
.
The creation of unit tests has been prepared or considered by injecting the reader and chunksize, which is missing here but is definitely recommended when working with indexes.
However, the point here is to find a way to significantly improve performance.
The above is the detailed content of Ignore lines containing pattern in long text file in Go. For more information, please follow other related articles on the PHP Chinese website!

go语言有缩进。在go语言中,缩进直接使用gofmt工具格式化即可(gofmt使用tab进行缩进);gofmt工具会以标准样式的缩进和垂直对齐方式对源代码进行格式化,甚至必要情况下注释也会重新格式化。

go语言叫go的原因:想表达这门语言的运行速度、开发速度、学习速度(develop)都像gopher一样快。gopher是一种生活在加拿大的小动物,go的吉祥物就是这个小动物,它的中文名叫做囊地鼠,它们最大的特点就是挖洞速度特别快,当然可能不止是挖洞啦。

是,TiDB采用go语言编写。TiDB是一个分布式NewSQL数据库;它支持水平弹性扩展、ACID事务、标准SQL、MySQL语法和MySQL协议,具有数据强一致的高可用特性。TiDB架构中的PD储存了集群的元信息,如key在哪个TiKV节点;PD还负责集群的负载均衡以及数据分片等。PD通过内嵌etcd来支持数据分布和容错;PD采用go语言编写。

go语言能编译。Go语言是编译型的静态语言,是一门需要编译才能运行的编程语言。对Go语言程序进行编译的命令有两种:1、“go build”命令,可以将Go语言程序代码编译成二进制的可执行文件,但该二进制文件需要手动运行;2、“go run”命令,会在编译后直接运行Go语言程序,编译过程中会产生一个临时文件,但不会生成可执行文件。

go语言需要编译。Go语言是编译型的静态语言,是一门需要编译才能运行的编程语言,也就说Go语言程序在运行之前需要通过编译器生成二进制机器码(二进制的可执行文件),随后二进制文件才能在目标机器上运行。

删除map元素的两种方法:1、使用delete()函数从map中删除指定键值对,语法“delete(map, 键名)”;2、重新创建一个新的map对象,可以清空map中的所有元素,语法“var mapname map[keytype]valuetype”。


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Notepad++7.3.1
Easy-to-use and free code editor

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software
