python 如何实现并行查找关键字所在的行？

Question

我有几十万个关键字放在文件4.txt中，想提取文件3.txt中含有关键字的行，保存到文件5.txt中.文件3有200万行，我使用下面的代码可以实现我的要求，但是非常慢，一个下午还没运行完，谁有快一点的方法？使用并行改...

阿神 · Answer

因为没有实际的文件，没有办法给你一个百分之百的保证，不过对于你的 code，我有一些些效率改进上的建议:

(也许你会发现改进后的代码根本不需要使用并行的解决的方案)

首先一个很大的问题是readlines()，这个方法会一口气读取file objects 中的所有行，这对于效率和资源的使用显然是极差的，几十万行几百万行的东西要一口气读完了，这可是非常恐怖的． readlines()，這個方法會一口氣讀取 file objects 中的所有行，這對於效率和資源的使用顯然是極差的，幾十萬行幾百萬行的東西要一口氣讀完了，這可是非常恐怖的．

詳細的分析和討論請參考Never call readlines() on a file

(文章中的這段話幾乎可當作是警語了)

There are hundreds of questions on places like StackOverflow about the readlines method, and in every case, the answer is the same.
"My code is takes forever before it even gets started, but it's pretty fast once it gets going."
That's because you're calling readlines.
"My code seems to be worse than linear on the size of the input, even though it's just a simple loop."
That's because you're calling readlines.
"My code can't handle giant files because it runs out of memory."
That's because you're calling readlines.

結論是: 建議所有使用 readlines 的地方全部改掉．

範例:

with open('XXX', 'r') as f:
    for line in f.readlines():
       # do something...

一律改成:

with open('XXX', 'r') as f:
    for line in f:
       # do something...

直覺上效率會好很多．

其次，你使用了 list 來查找關鍵字，這也是相當沒效率的:

for i in a:
    if i in new_line:

為了確認 new_line 中是否有關鍵字 i，這邊走訪了一整個關鍵字 list: a，對於一般的情況可能還好，但是數十萬的關鍵字比對，對每一行都走訪一次 a 會造成大量的時間浪費，假設 a 裡面有 x 個關鍵字，f3 中有 y 行，每行有 z 個字，這邊要花的時間就是 x*y*z(根據你文件的行數，這個數量級極為驚人)．

如果簡單地利用一些使用 hash 來查找的容器肯定會好一些，比如說 dictionary 或是 set．

最後是關於你的查找部分:

for li in f3.readlines():
    new_line = li.strip().split()[1][:-2]
    for i in a:
        if i in new_line:
            f5.writelines(li)

這邊我不是很懂，new_line 看起來是一個子字串，然後現在要用這個字串去比對關鍵字？

不過先撇開這個不談，關於含有關鍵字的 new_line 在印出後，似乎不該繼續循環 a，除非你的意思是 new_line 中有幾個關鍵字我就要印 line 幾次．否則加上一個 break

详细的分析和讨论请参考Never call readlines() on a file

(文章中的这段话几乎可当作是警语了)

There are hundreds of questions on places like StackOverflow about the readlines method, and in every case, the answer is the same.
"My code is takes forever before it even gets started, but it's pretty fast once it gets going."
That's because you're calling readlines.
"My code seems to be worse than linear on the size of the input, even though it's just a simple loop."
That's because you're calling readlines.
"My code can't handle giant files because it runs out of memory."
That's because you're calling readlines.

🎜结论是: 建议所有使用 readlines 的地方全部改掉． 🎜 🎜范例:🎜

with open('3.txt') as f3, open('4.txt') as f4, open('result.txt', 'w') as f5:
    keywords = set(line.strip() for line in f4)
    for line in f3:
        new_line = line.strip().split()[1][:-2]
        for word in new_line:
            if word in keywords:
                print(line, file=f5)
                break

🎜一律改成:🎜 rrreee 🎜直觉上效率会好很多． 🎜 🎜 🎜其次，你使用了 list 来查找关键字，这也是相当没效率的:🎜 rrreee 🎜为了确认new_line 中是否有关键字i，这边走访了一整个关键字list: a，对于一般的情况可能还好，但是数十万的关键字比对，对每一行都走访一次a 会造成大量的时间浪费，假设a 里面有x 个关键字，< code>f3 中有y 行，每行有z 个字，这边要花的时间就是x*y*z(根据你文件的行数，这个数量级极为惊人)． 🎜 🎜如果简单地利用一些使用 hash 来查找的容器肯定会好一些，比如说 dictionary 或是 set． 🎜 🎜 🎜最后是关于你的查找部分:🎜 rrreee 🎜这边我不是很懂，new_line 看起来是一个子字串，然后现在要用这个字串去比对关键字？ 🎜 🎜不过先撇开这个不谈，关于含有关键字的new_line 在印出后，似乎不该继续循环a，除非你的意思是

new_line< /code> 中有几个关键字我就要印line 几次． 否则加上一个 break 也是可以加快速度． 🎜
🎜
🎜建议你的code改为:🎜
rrreee
🎜如果我有弄错你的意思，欢迎跟我说，我们再来讨论一下，直觉上应该不必使用到并行就可以解决你的问题🎜

伊谢尔伦 · Answer

ac自动机

黄舟 · Answer

根据@dokelung的答案，稍做修改，基本能达到我的要求。这个答案与使用 grep -f 4.txt 3.txt > 5.txt的有些不同，我在比较两个结果文件不同点在什么地方。

with open('3.txt') as f3, open('4.txt') as f4, open('result.txt', 'w') as f5:
    keywords = set(line.strip() for line in f4)
    for line in f3:
        new_line = line.strip().split()[1][:-2]
        if new_line in keywords:
            print(line.strip(), file=f5)

python 如何实现并行查找关键字所在的行？

全部回复(3)我来回复