用 python 给数据打标签，500 万条数据怎样提高效率？

Question

新手向大家求助，需要用python对一列word打标签，标签规则是包含其中某些词就标记成某个标签。word数量比较多大约有500万个词。我用下面的代码标注，效率特别低，需要一个多小时才能标注完。求问有什么优化更快的...

迷茫 · Answer

So do you really use pandas as a tool for reading data?.

Added a column is_tobacco as the mark you said

filter_query returns a list containing these words, and the efficiency has been improved

Secondly, you can split it and use multiprocessing to execute it. This will speed up the process by more than a little

import pandas as pd
word = pd.read_table('test.txt', encoding = 'utf-8', names = ['query'])

def signquery(word):
    tobacco = [u'烟', u'白沙', u'黄金叶', u'利群', u'南京九五', u'黄鹤楼软',  u'黄鹤楼硬', u'娇子', u'钻石荷花', u'玉溪', u'七匹狼尚品',  u'七匹狼软灰']
    word['is_tobacco'] = word['query'].apply(lambda name:name in tobacco)
    return word

def filter_query(word):
    tobacco = [u'烟', u'白沙', u'黄金叶', u'利群', u'南京九五', u'黄鹤楼软',  u'黄鹤楼硬', u'娇子', u'钻石荷花', u'玉溪', u'七匹狼尚品',  u'七匹狼软灰']
    return word[word['query'].apply(lambda name:name in tobacco)]['query'].to_dict().values()

result = filter_query(word)

print result

怪我咯 · Answer

You can try using regular expressions:

import re
pattern = re.compile(u'烟|白沙|黄金叶|利群|南京九五|黄鹤楼软|黄鹤楼硬|娇子|钻石荷花|玉溪|七匹狼尚品|七匹狼软灰')
result = filter(pattern.search, word['query'])

ringa_lee · Answer

KMP algorithm

天蓬老师 · Answer

KMP
Manacher
TireTree

用 python 给数据打标签，500 万条数据怎样提高效率？

reply all(4)I'll reply