Home  >  Q&A  >  body text

python - Questions about NumPy array operations

['000001_2017-03-17.csv', '000001_2017-03-20.csv',
 '000002_2017-03-21.csv', '000002_2017-03-22.csv',
 '000003_2017-03-23.csv', '000004_2017-03-24.csv']

numpy array has tens of thousands of elements in total. Now I want to retain the number 000001 or the like in front of each element, and remove duplicates, leaving only a unique number. The result should be ['000001','000002','000003','000004']
In addition to using the for statement, is there a more efficient way?

ringa_leeringa_lee2667 days ago1290

reply all(3)I'll reply

  • 迷茫

    迷茫2017-06-30 09:58:09

    Let’s write NumPy~

    python3

    >>> import numpy as np
    >>> a = np.array(['000001_2017-03-17.csv', '000001_2017-03-20.csv',
     '000002_2017-03-21.csv', '000002_2017-03-22.csv',
     '000003_2017-03-23.csv', '000004_2017-03-24.csv'])
    
    >>> b = np.unique(np.fromiter(map(lambda x:x.split('_')[0],a),'|S6'))
    >>> b
    array([b'000001', b'000002', b'000003', b'000004'], 
          dtype='|S6')

    You can also write like this: np.frompyfunc
    '|S6' is to store strings in 6 bytes

    '<U6' is a string stored in 6 little-endian Unicode characters

    >>> b = np.array(np.unique(np.frompyfunc(lambda x:x[:6],1,1)(a)),dtype='<U6')
    >>> b
    array(['000001', '000002', '000003', '000004'], 
          dtype='<U6')

    reply
    0
  • 学习ing

    学习ing2017-06-30 09:58:09

    Based on the two brothers’ writing methods
    @agree and accept @xiaojieluoff

    If the length of the number is fixed to the first six digits, the fastest way to write it is the first one below

    import time
    lst = ['000001_2017-03-17.csv', '000001_2017-03-20.csv', '000002_2017-03-21.csv', '000002_2017-03-22.csv', '000003_2017-03-23.csv', '000004_2017-03-24.csv'] * 1000000
    
    start = time.time()
    data = {_[:6] for _ in lst}
    print 'dic: {}'.format(time.time() - start)
    
    start = time.time()
    data = set(_[:6] for _ in lst)
    print 'set: {}'.format(time.time() - start)
    
    start = time.time()
    data = set(map(lambda _: _[:6], lst))
    print('map:{}'.format(time.time() - start))
    
    start = time.time()
    data = set()
    [data.add(_[:6]) for _ in lst]
    print('for:{}'.format(time.time() - start))
    
    耗时:
    dic: 0.72798705101
    set: 0.929664850235
    map:1.89214396477
    for:1.76194214821
    

    reply
    0
  • 某草草

    某草草2017-06-30 09:58:09

    Use map and anonymous functions

    lists = ['000001_2017-03-17.csv', '000001_2017-03-20.csv','000002_2017-03-21.csv','000002_2017-03-22.csv','000003_2017-03-23.csv', '000004_2017-03-24.csv']
    
    data = list(set(map(lambda x:x.split('_')[0], lists)))
    
    print(data)

    Output:

    ['000003', '000004', '000001', '000002']

    Run the following code and you can see that with 6 million pieces of data, map is about 0.6s faster than for

    import time
    
    
    lists = ['000001_2017-03-17.csv', '000001_2017-03-20.csv', '000002_2017-03-21.csv', '000002_2017-03-22.csv', '000003_2017-03-23.csv', '000004_2017-03-24.csv'] * 1000000
    
    map_start = time.clock()
    
    map_data = list(set(map(lambda x:x.split('_')[0], lists)))
    
    
    map_end = (time.clock() - map_start)
    
    print('map 运行时间:{}'.format(map_end))
    
    
    for_start = time.clock()
    
    data = set()
    for k in lists:
        data.add(k.split('_')[0])
    
    for_end = (time.clock() - for_start)
    print('for 运行时间:{}'.format(for_end))
    

    Output:

    map 运行时间:2.36173
    for 运行时间:2.9405870000000003

    If the test data is expanded to 60 million, the gap will be even more obvious

    map 运行时间:29.620203
    for 运行时间:33.132621
    

    reply
    0
  • Cancelreply