search

Home  >  Q&A  >  body text

正则表达式 - Java 读取txt格式语料库并匹配指定字符串,如何可以快速完成?

有一个9M多行的语料库,文件大小4G。现在需要匹配指定动词,符合句子条件的输出。
但是文件过大。每次读取一行。匹配下来要好久。请问有没有什么方法可以加快处理速度。

BufferedReader cpreader = new BufferedReader(new InputStreamReader(new FileInputStream(this.getCorpusPath())));
tring line = cpreader.readLine();
while(line != null)
            {
                ArrayList<String> verbList = new ArrayList();
                matcher_line = Pattern.compile("(.*\\%\\&\\$cook\\%\\&\\$VB.*)").matcher(line);
                if(matcher_line.find())
                {
                    System.out.println(line);
                }
                
                
                
                line = cpreader.readLine();
            }
迷茫迷茫2811 days ago736

reply all(4)I'll reply

  • ringa_lee

    ringa_lee2017-04-17 17:52:07

    There should be no problem reading files, but you can try changing to buffered reading, because the size of a line may be uncertain, which will affect efficiency. .
    If the matching is a single word, you can use a better matching method. I don’t know if it is regular

    reply
    0
  • 高洛峰

    高洛峰2017-04-17 17:52:07

    Your program is processed line by line. Single-threaded processing is definitely slow. Use multi-threaded processing. Each thread processes one line. After processing, the next line is requested to be processed. When reading lines, it is best to use cache to read multiple lines. , and then allocate it to multiple threads for processing, so that the CPU can be maximized.

    reply
    0
  • PHP中文网

    PHP中文网2017-04-17 17:52:07

    nio+multi-threading

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 17:52:07

    Pattern.compile("(.*\%\&\$cook\%\&\$VB.*)")
    

    This is inside the loop, and the regular expression must be compiled every time, so it is very slow. You can put this outside the while and have a look

    reply
    0
  • Cancelreply