文档内容如下:
(数据对) (信息)
----------------- ------------------------
1 2 3 4 5
----------------- ------------------------
pr333 sd23a2 thisisa 1001 1005
pr333 sd23a2 sentence 1001 1005
pr33w sd11aa we 1022 1002
pr33w sd11aa have 1022 1002
pr33w sd11aa adream 1033 1002
......
第 1, 2 列作为一个 数据对
如果前两列相同,判断后面的是否相同,如果不同就连接起来,合并成一行
如同下面的效果:
pr333 sd23a2 thisisa|sentence 1001 1005
pr33w sd11aa we|have|adream 1022|1033 1002
....
小白,不懂怎么做,只能想到用字典,好像又行不通,求各位大神帮忙
阿神2017-04-17 17:52:17
If you want to maintain the order of the output, you must use OrderedDict
了,key用OrderedDict
来保持顺序,后面的信息用list
来保持顺序,后面可以乱的话,用set
is a better choice
import re
from collections import OrderedDict
datas = OrderedDict()
with open('info.txt') as f:
for line in f:
if not line.strip().startswith('pr'):
continue
items = re.split(r'\s+', line.strip())
key = ' '.join(items[:2])
if key not in datas:
datas[key] = [[item] for item in items[2:]]
else:
for item, data in zip(items[2:], datas[key]):
data.append(item)
for key, value in datas.items():
print(key, *map('|'.join, value))
阿神2017-04-17 17:52:17
Explain all the considerations for this code.
The first is the order. The order here has two parts, one is the order of the output lines, and the other is the order after the items are merged. We observed:
pr333 sd23a2 thisisa 1001 1005
pr333 sd23a2 sentence 1001 1005
pr33w sd11aa we 1022 1002
pr33w sd11aa have 1022 1002
pr33w sd11aa adream 1033 1002
becomes:
pr333 sd23a2 thisisa|sentence 1001 1005
pr33w sd11aa we|have|adream 1022|1033 1002
The order of output lines should be taken into account: pr333 comes before pr33w
The order after merging the projects should be taken into account: thisisa comes before sentence
This means that the data type we use must be able to maintain the order
The second is speed. We all know that the sequence type is a linear search. For efficiency, it is better to use the mapping type.
After three considerations, as moling3650 said, OrderedDict
is a good choice. This can solve the problem of line output. However, since the merge project only needs to use the key and not the value, it is a pity to use OrderedDict
. However, there is currently no OrderSet
in the standard library. choice, so I had to make do with it. OrderedDict
是個好選擇.這可以解決行輸出的問題,不過合併項目由於只需要用到 key 而不需要用到 value 所以使用 OrderedDict
有點可惜,不過目前標準庫中沒有 OrderSet
的選擇,所以只好將就著用一下.
有關於 OrderedDict 可以參閱 OrderedDict
其實有一個 OrderedSet 的第三方庫 OrderedSet
或者可以自己實作,請參考 OrderedSet (Python recipe)
最後 linkse7en 大的觀點非常好,這類文檔處理的問題,如果能夠邊讀邊寫,邊讀邊處理絕對是 有效率(因為只需要進行一次文檔的走訪)(討論請見評論部分 moling 大的觀點
) 且 省資源(馬上輸出完畢,不必浪費空間儲存資料).但因為考慮到有可能重複的 數據對 有可能跨行出現,所以依然是多花一點資源來確保穩固性.
代碼(Python3):
from collections import OrderedDict
data = OrderedDict()
DPAIR = slice(0,2)
MSG = slice(2,None)
with open('data.txt', 'r') as reader:
for line in reader:
line = line.strip()
items = tuple(line.split())
msgs = data.setdefault(items[DPAIR], [OrderedDict({}) for msg in items[MSG]])
for idx, msg in enumerate(msgs):
msg.setdefault(items[MSG][idx], None)
for (dp1, dp2), msgs in data.items():
print(dp1, dp2, *['|'.join(msg.keys()) for msg in msgs])
關於代碼部分也做個說明(也許我寫的不是最好,但有些心得可以分享).
首先是 slice
類的應用.
身為一個 Python programmer,我們對 序列型態 取切片(slicing) 應該都不陌生.
items[start:stop:step]
其實可以寫成:
items[slice(start, stop, step)]
# example
items[:5] 可以寫成 items[slice(0,5)]
items[7:] 可以寫成 items[slice(7,None)]
那好處是什麼呢?
我們可以利用這個特性對切片進行命名,以這個問題的代碼為例,原本要取出 數據對 與 其他資料 可以用:
items = tuple(line.split())
items[0:2] # 這是用來做 key 的數據對
items[2:] # 這是其他的資料項
但是這種方式其實閱讀起來不夠清晰,我們可以幫這兩個範圍取個名字,所以:
DPAIR = slice(0,2)
MSG = slice(2,None)
items[DPAIR] # 這是用來做 key 的數據對
items[MSG] # 這是其他的資料項
我們可以用比較優雅易讀的方式來從 items
中取值.
其次是 setdefault
,這個函數相當實用,舉例:
dic.setdefault(key, default_value)
如果字典(或其他相符的映射型態)中存在鍵值 key
則回傳 dic[key]
否則回傳自動在字典中插入新的鍵值對 dic[key] = default_value
並且回傳 default_value
For more information about OrderedDict, please refer to OrderedDict
🎜In fact, there is a third-party library for OrderedSet, OrderedSet
For discussion, please refer to the comment section for moling's big opinions
) And 🎜saving resources🎜 (the output is completed immediately, no need to waste space to store data). However, considering that duplicate data pairs may appear across rows, it is still necessary to spend more resources to ensure stability. 🎜
🎜
🎜🎜Code (Python3)🎜:🎜
for (a, b), c, d in ((1,2) ,3, 4):
print(a, b, c, d) # 印出 1 2 3 4
🎜
🎜I will also explain the code part (maybe my writing is not the best, but I can share some experiences). 🎜
🎜The first is the application of slice
class. 🎜
🎜As a Python programmer, we should be familiar with 🎜sequence type🎜 slicing. 🎜
rrreee
🎜In fact, it can be written as:🎜
rrreee
🎜What are the benefits?🎜
🎜We can use this feature to name slices. Taking the code in this question as an example, we originally wanted to extract 🎜data pairs🎜 and 🎜other data🎜 using:🎜
rrreee
🎜But this method is not clear enough to read. We can give these two ranges a name, so:🎜
rrreee
🎜We can get the value from items
in a more elegant and easy-to-read way. 🎜
🎜
🎜The second is setdefault
, this function is quite practical, for example: 🎜
rrreee
🎜If the key value key
exists in the dictionary (or other matching mapping type), return dic[key]
, otherwise the return will automatically insert a new key value in the dictionary. For dic[key] = default_value
and return default_value
. 🎜
🎜
🎜The last thing I want to share is the disassembly of nested tuples:🎜
rrreee
🎜This technique can be easily used to dismantle nested tuples. 🎜Thank you everyone for not saying I talk too much...
怪我咯2017-04-17 17:52:17
I feel like it’s much more convenient to use pandas
import pandas as pd
df = pd.read_csv('example.txt',sep=' ',header=None)
df = df.astype(str) # 将数字转换为字符串
grouped = df.groupby([0,1])
result = grouped.agg(lambda x:'|'.join(x))
Four lines solve the problem
I saved the document as example.txt first
高洛峰2017-04-17 17:52:17
from collections import defaultdict
a = '''
pr333 sd23a2 thisisa 1001 1005
pr333 sd23a2 sentence 1001 1005
pr33w sd11aa we 1022 1002
pr33w sd11aa have 1022 1002
pr33w sd11aa adream 1033 1002
'''
data = defaultdict(dict)
keys = []
for line in a.split('\n'):
if not line:
continue
items = line.split()
key = ' '.join(items[:2])
keys.append(key)
for i, item in enumerate(items[2:]):
data[key][i] = data[key].get(i, []) + [item]
for key in sorted(list(set(keys)), key=keys.index):
value = data[key]
print key,
for i in sorted(value.keys()):
vs = list(set(value[i]))
vs.sort(key=value[i].index)
print '|'.join(vs),
print