Home  >  Article  >  Backend Development  >  How to implement line-by-line deduplication of text in Python

How to implement line-by-line deduplication of text in Python

WBOY
WBOYOriginal
2016-12-05 13:27:131938browse

Text:

Each row contains some numbers after promotion. If these numbers are the same, they are considered to be the same row. For the same rows, only one row is kept.

Thoughts:

Cut based on dictionary and string.

Create an empty dictionary.

Read the text and cut the first half of each line. During the process of reading the text, loop through the dictionary to search. If not found, write the line to the dictionary. Otherwise, it means that the row has been written into the dictionary (that is, a duplicate row has appeared) and will no longer be written into the dictionary. This achieves the purpose of retaining only one row for duplicate rows.

The text is as follows:

/promotion/232 utm_source
/promotion/237 LandingPage/borrowExtend/? ;
/promotion/25113 LandingPage/mhd
/promotion/25113 LandingPage/mhd
/promotion/25199 com/LandingPage
/promotion/254 LandingPage/mhd/mhd4/? ;
/promotion/259 LandingPage/ydy/? ;
/promotion/25113 LandingPage/mhd
/promotion/25199 com/LandingPage
/promotion/25199 com/LandingPage

The procedure is as follows:

line_dict_uniq = dict()
with open('1.txt','r') as fd:
for line in fd:
key = line.split(' ')[0]
if key not in line_dict_uniq.values():
line_dict_uniq[key] = line
else:
continue
print line_dict_uniq 
print len(line_dict_uniq)
# 这里是打印了不重复的行(重复的只打印一次),实际再把这个结果写入文件就可以了,
# 就不写这段写入文件的代码了

The execution efficiency of the above program is relatively low, changing it to the following will improve it:

line_dict_uniq = dict()
with open('1.txt','r') as fd:
for line in fd:
key = line.split(' ')[0]
if key not in line_dict_uniq.keys():
line_dict_uniq[key] = line
else:
continue
print line_dict_uniq
print len(line_dict_uniq)

The above is the Python that the editor introduces to you to deduplicate text by line. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time. I would also like to thank you all for your support of the Script House website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn