search

Home  >  Q&A  >  body text

python:怎样合并文档中有重复部分的行?

文档内容如下:

   (数据对)              (信息)
-----------------  ------------------------
  1         2         3        4       5
-----------------  ------------------------
pr333    sd23a2    thisisa    1001    1005
pr333    sd23a2    sentence    1001    1005
pr33w    sd11aa    we    1022    1002
pr33w    sd11aa    have    1022    1002
pr33w    sd11aa    adream    1033    1002
......

第 1, 2 列作为一个 数据对

如果前两列相同,判断后面的是否相同,如果不同就连接起来,合并成一行

如同下面的效果:

pr333    sd23a2    thisisa|sentence    1001    1005
pr33w    sd11aa    we|have|adream    1022|1033    1002
....

小白,不懂怎么做,只能想到用字典,好像又行不通,求各位大神帮忙

大家讲道理大家讲道理2805 days ago1133

reply all(4)I'll reply

  • 阿神

    阿神2017-04-17 17:52:17

    If you want to maintain the order of the output, you must use OrderedDict了,key用OrderedDict来保持顺序,后面的信息用list来保持顺序,后面可以乱的话,用set is a better choice

    import re
    from collections import OrderedDict
    
    datas = OrderedDict()
    
    with open('info.txt') as f:
        for line in f:
            if not line.strip().startswith('pr'):
                continue
            items = re.split(r'\s+', line.strip())
            key = ' '.join(items[:2])
            if key not in datas:
                datas[key] = [[item] for item in items[2:]]
            else:
                for item, data in zip(items[2:], datas[key]):
                    data.append(item)
    
    for key, value in datas.items():
        print(key, *map('|'.join, value))

    reply
    0
  • 阿神

    阿神2017-04-17 17:52:17

    Explain all the considerations for this code.


    The first is the order. The order here has two parts, one is the order of the output lines, and the other is the order after the items are merged. We observed:

    pr333    sd23a2    thisisa    1001    1005
    pr333    sd23a2    sentence    1001    1005
    pr33w    sd11aa    we    1022    1002
    pr33w    sd11aa    have    1022    1002
    pr33w    sd11aa    adream    1033    1002

    becomes:

    pr333 sd23a2 thisisa|sentence 1001 1005
    pr33w sd11aa we|have|adream 1022|1033 1002
    1. The order of output lines should be taken into account: pr333 comes before pr33w

    2. The order after merging the projects should be taken into account: thisisa comes before sentence

    This means that the data type we use must be able to maintain the order


    The second is speed. We all know that the sequence type is a linear search. For efficiency, it is better to use the mapping type.

    After three considerations, as moling3650 said, OrderedDict is a good choice. This can solve the problem of line output. However, since the merge project only needs to use the key and not the value, it is a pity to use OrderedDict. However, there is currently no OrderSet in the standard library. choice, so I had to make do with it. OrderedDict 是個好選擇.這可以解決行輸出的問題,不過合併項目由於只需要用到 key 而不需要用到 value 所以使用 OrderedDict 有點可惜,不過目前標準庫中沒有 OrderSet 的選擇,所以只好將就著用一下.

    1. 有關於 OrderedDict 可以參閱 OrderedDict

    2. 其實有一個 OrderedSet 的第三方庫 OrderedSet
      或者可以自己實作,請參考 OrderedSet (Python recipe)


    最後 linkse7en 大的觀點非常好,這類文檔處理的問題,如果能夠邊讀邊寫,邊讀邊處理絕對是 有效率(因為只需要進行一次文檔的走訪)(討論請見評論部分 moling 大的觀點) 且 省資源(馬上輸出完畢,不必浪費空間儲存資料).但因為考慮到有可能重複的 數據對 有可能跨行出現,所以依然是多花一點資源來確保穩固性.


    代碼(Python3):

    from collections import OrderedDict
    
    data = OrderedDict()
    
    DPAIR = slice(0,2)
    MSG = slice(2,None)
    
    with open('data.txt', 'r') as reader:
        for line in reader:
            line = line.strip()
            items = tuple(line.split())
    
            msgs = data.setdefault(items[DPAIR], [OrderedDict({}) for msg in items[MSG]])
            for idx, msg in enumerate(msgs):
                msg.setdefault(items[MSG][idx], None)
    
    for (dp1, dp2), msgs in data.items():
        print(dp1, dp2, *['|'.join(msg.keys()) for msg in msgs])

    關於代碼部分也做個說明(也許我寫的不是最好,但有些心得可以分享).

    首先是 slice 類的應用.

    身為一個 Python programmer,我們對 序列型態 取切片(slicing) 應該都不陌生.

    items[start:stop:step]

    其實可以寫成:

    items[slice(start, stop, step)]
    
    # example
    items[:5]  可以寫成  items[slice(0,5)]
    items[7:]  可以寫成  items[slice(7,None)]

    那好處是什麼呢?

    我們可以利用這個特性對切片進行命名,以這個問題的代碼為例,原本要取出 數據對其他資料 可以用:

    items = tuple(line.split())
    items[0:2]  # 這是用來做 key 的數據對
    items[2:]   # 這是其他的資料項

    但是這種方式其實閱讀起來不夠清晰,我們可以幫這兩個範圍取個名字,所以:

    DPAIR = slice(0,2)
    MSG = slice(2,None)
    items[DPAIR] # 這是用來做 key 的數據對
    items[MSG]   # 這是其他的資料項

    我們可以用比較優雅易讀的方式來從 items 中取值.


    其次是 setdefault,這個函數相當實用,舉例:

    dic.setdefault(key, default_value)

    如果字典(或其他相符的映射型態)中存在鍵值 key 則回傳 dic[key] 否則回傳自動在字典中插入新的鍵值對 dic[key] = default_value 並且回傳 default_value


    For more information about OrderedDict, please refer to OrderedDict

    🎜In fact, there is a third-party library for OrderedSet, OrderedSet
    or you can implement it yourself, please refer to OrderedSet (Python recipe)🎜🎜 🎜 🎜 🎜Finally, linkse7en has a very good point. For this kind of document processing problem, if you can read and write at the same time, reading and processing at the same time will definitely be 🎜efficient🎜 (because you only need to visit the document once) (For discussion, please refer to the comment section for moling's big opinions) And 🎜saving resources🎜 (the output is completed immediately, no need to waste space to store data). However, considering that duplicate data pairs may appear across rows, it is still necessary to spend more resources to ensure stability. 🎜 🎜 🎜🎜Code (Python3)🎜:🎜
    for (a, b), c, d in ((1,2) ,3, 4):
        print(a, b, c, d)  # 印出 1 2 3 4
    🎜 🎜I will also explain the code part (maybe my writing is not the best, but I can share some experiences). 🎜 🎜The first is the application of slice class. 🎜 🎜As a Python programmer, we should be familiar with 🎜sequence type🎜 slicing. 🎜 rrreee 🎜In fact, it can be written as:🎜 rrreee 🎜What are the benefits?🎜 🎜We can use this feature to name slices. Taking the code in this question as an example, we originally wanted to extract 🎜data pairs🎜 and 🎜other data🎜 using:🎜 rrreee 🎜But this method is not clear enough to read. We can give these two ranges a name, so:🎜 rrreee 🎜We can get the value from items in a more elegant and easy-to-read way. 🎜 🎜 🎜The second is setdefault, this function is quite practical, for example: 🎜 rrreee 🎜If the key value key exists in the dictionary (or other matching mapping type), return dic[key], otherwise the return will automatically insert a new key value in the dictionary. For dic[key] = default_value and return default_value. 🎜 🎜 🎜The last thing I want to share is the disassembly of nested tuples:🎜 rrreee 🎜This technique can be easily used to dismantle nested tuples. 🎜

    Thank you everyone for not saying I talk too much...

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 17:52:17

    I feel like it’s much more convenient to use pandas

    import pandas as pd
    df = pd.read_csv('example.txt',sep=' ',header=None)
    df = df.astype(str) # 将数字转换为字符串
    grouped = df.groupby([0,1])
    result = grouped.agg(lambda x:'|'.join(x))

    Four lines solve the problem
    I saved the document as example.txt first

    reply
    0
  • 高洛峰

    高洛峰2017-04-17 17:52:17

    from collections import defaultdict
    
    a = '''
    pr333 sd23a2 thisisa 1001 1005
    pr333 sd23a2 sentence 1001 1005
    pr33w sd11aa we 1022 1002
    pr33w sd11aa have 1022 1002
    pr33w sd11aa adream 1033 1002
    '''
    
    data = defaultdict(dict)
    keys = []
    
    for line in a.split('\n'):
        if not line:
            continue
        items = line.split()
        key = ' '.join(items[:2])
        keys.append(key)
        for i, item in enumerate(items[2:]):
            data[key][i] = data[key].get(i, []) + [item]
    for key in sorted(list(set(keys)), key=keys.index):
        value = data[key]
        print key,
        for i in sorted(value.keys()):
            vs = list(set(value[i]))
            vs.sort(key=value[i].index)
            print '|'.join(vs),
        print
    

    reply
    0
  • Cancelreply