search

Home  >  Q&A  >  body text

python优化代码(文本查找)效率?

#!/usr/bin/env python
# -*- coding: utf-8 -*-
infile2 = open('genemark.gff3', 'r')
infile1 = set(line1.strip() for line1 in open('1.txt', 'r'))
for line in infile2:
    line = line.strip().split()
    if line[2] == 'gene':
        chr, start, end = line[0], int(line[3]), int(line[4])
        for line1 in infile1:
            line1 = line1.split()
            chr1, start1, end1 = line1[1], int(line1[2]), int(line1[3])
            if chr1 == chr:
                if start1 < start < end1:
                    print line1[0], line[-1]
                if start1 < end < end1:
                    print line1[0], line[-1]
                if start1 > start and end > end1:
                    print line1[0], line[-1]

genemark.gff3 格式类似下边:

chr1D    GeneMark.hmm    gene    2705930    2711118    .    +    .    ID=1903228_g;Name=1903228_g
chr1D    GeneMark.hmm    mRNA    2705930    2711118    .    +    .    ID=1903228_t;Name=1903228_t;Parent=1903228_g

1.txt:

UN011157 chr1D 2705329 2706342 98.4 95.0 972 30 21 0
UN003843 chr1D 2705681 2721144 61.4 97.4 633 12 5 0

附上原始文件的百度云链接,希望感兴趣的参考
点击下载 密码 enu8

综合楼下各位朋友的答案,现推荐两种
第一种 根据 @ferstar @用筹兮用严 的答案,即并行版

#!/usr/bin/env python
# encoding: utf-8

from collections import defaultdict
from multiprocessing import Pool, cpu_count
from functools import partial


def find_sth(f2, f1=None):
    start, end = int(f2[3]), int(f2[4])
    for uno1, start1, end1 in f1[f2[0]]:
        if (start1 <= start and start <= end1) or (start1 <= end and end <= end1) or (start1 >= start and end >= end1):
            with open("out.txt", "a") as fh:
                fh.write(uno1 + "\t" + f2[-1] + "\n")
                #print(uno1, f2[-1])


def main():
    with open('1.txt', 'r') as f1:
        infile1 = defaultdict(set)
        for uno1, chr1, start1, end1, *others in map(str.split, f1):
            infile1[chr1].add((uno1, int(start1), int(end1)))
    with open('genemark.gff3', 'r') as f2:
        infile2 = [x for x in map(str.split, f2) if x[2] == 'gene']
    pool = Pool(cpu_count())
    pool.map(partial(find_sth, f1=infile1), infile2)
    pool.close()
    pool.join()


if __name__ == "__main__":
    main()

第二种 @citaret 他的版本(单核版),对单核来说,不逊于上述代码。但是两者结果稍有不同,并行版结果更全(这里少了73条,出在判断条件的边界问题,由于对intervaltree熟悉,怎么改还不知道),现在边界问题已修改,两种代码结果完全一样,perfect!
如下


from collections import defaultdict
from intervaltree import Interval, IntervalTree

with open('1.txt') as f:
    d1 = defaultdict(list)
    xs = map(lambda x: x.strip().split(), f)
    for x in xs:
        y = (x[0], int(x[2]), int(x[3]))
        d1[x[1]].append(y)

    for k, v in d1.items():
        d1[k] = IntervalTree(Interval(s, e, u) for u, s, e in v)

with open('genemark.gff3') as f:
    for line in f:
        line = line.strip().split()
        if line[2] == 'gene':
            chr, start, end = line[0], int(line[3]), int(line[4])
            for start1, end1, un1 in d1[chr][start-1:end+1]:
                print(un1, line[-1])
PHP中文网PHP中文网2889 days ago450

reply all(6)I'll reply

  • 巴扎黑

    巴扎黑2017-04-18 09:18:12

    There should be no room for better optimization in performance, but the code can be slightly adjusted

    with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
        lines1 = [_.strip().split() for _ in f1]
        for line2 in f2:
            line2 = line2.strip().split()
            if line2[2] != 'gene':
                continue
    
            chr2, start2, end2 = line2[0], int(line2[3]), int(line2[4])
            for line1 in lines1:
                chr1, start1, end1 = line1[1], int(line1[2]), int(line1[3])
                if chr1 == chr2 and (start1 < start2 < end1 or start1 < end2 < end1 or start1 > start2 and end2 > end1):
                    print line1[0], line2[-1]

    reply
    0
  • 阿神

    阿神2017-04-18 09:18:12

    Here are two suggestions:

    1. The code is nested too deeply. In the function, you can reduce the nesting level by returning as early as possible. Similarly, in the loop, you can use continue to reduce the nesting level.

    2. About performance

    for line1 in infile1:
        line1 = line1.split()
    

    It is very unwise to split the lines in file1 every time through the loop

    The following is the code I modified

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    infile2 = open('genemark.gff3', 'r')
    infile1 = {}
    for line1 in open('1.txt', 'r'):
        line1 = line1.strip().split()
        id, chr1, start1, end1 = line1[0], line1[1], int(line1[2]), int(line1[3])
        if not infile1.has_key(chr1):
            infile1[chr1] = []
        infile1[chr1].append({"start": start1, "end": end1, "id": id})
    
    for line in infile2:
        line = line.strip().split()
    
        if line[2] != 'gene':
            continue
    
        chr, start, end = line[0], int(line[3]), int(line[4])
    
        if not infile1.has_key(chr):
            continue
    
        for i in infile1[chr]:
            if i['start'] < start < i['end']:
                print i['id'], line[-1]
            if i['start'] < end < i['end']:
                print i['id'], line[-1]
            if i['start'] > start and i['end'] > end:
                print i['id'], line[-1]
    

    reply
    0
  • PHPz

    PHPz2017-04-18 09:18:12

    Exchange space for time to construct the list of genemark.gff3 and the dictionary of 1.txt respectively. The specific implementation is:

    from collections import defaultdict
    
    with open('genemark.gff3') as f:
        ls = f.readlines()
        xs = map(lambda x: x.strip().split(), ls)
        t2 = (x for x in xs if x[2] == 'gene')
    
    with open('1.txt') as f:
        d1 = defaultdict(list)
        ls = f.readlines()
        xs = map(lambda x: x.strip().split(), ls)
        for x in xs:
            d1[x[1]].append(x)
    
    for line in t2:
        chr, start, end = line[0], int(line[3]), int(line[4])
        if chr in d1:
            for line1 in d1[chr]:
                chr1, start1, end1 = line1[1], int(line1[2]), int(line1[3])
                if start1 < start < end1:
                    print line1[0], line[-1]
                if start1 < end < end1:
                    print line1[0], line[-1]
                if start1 > start and end > end1:
                    print line1[0], line[-1]

    Modified version v2, eliminates int() in the inner loop and simplifies the output.

    from collections import defaultdict
    
    with open('1.txt') as f:
        d1 = defaultdict(list)
        xs = map(lambda x: x.strip().split(), f)
        for x in xs:
            y = (x[0], int(x[2]), int(x[3]))
            d1[x[1]].append(y)
    
    with open('genemark.gff3') as f:
        for line in f:
            line = line.strip().split()
            chr, start, end = line[0], int(line[3]), int(line[4])
            for un1, start1, end1 in d1[chr]:
                if start < end1 and end > start1:
                    print un1, line[-1]

    v3: After carefully studying the meaning of the question, we found that the main loop is to find all the fragments in the set that intersect with a fragment. We can take a look at this set first:

    chr1D 7359
    chr2D 9219
    chr2B 9486
    chr2A 8986
    chr6B 7178
    chr6A 6446
    chr6D 6093
    chr4A 7543
    chr4B 7086
    chr4D 6316
    ...
    

    The number of fragments in each collection is 6000-10000, and traversal is inefficient. Therefore, consider using intervaltree to quickly get all the fragments that intersect with a fragment. The specific implementation is:

    from collections import defaultdict
    from intervaltree import Interval, IntervalTree
    
    with open('1.txt') as f:
        d1 = defaultdict(list)
        xs = map(lambda x: x.strip().split(), f)
        for x in xs:
            y = (x[0], int(x[2]), int(x[3]))
            d1[x[1]].append(y)
    
        for k, v in d1.items():
            d1[k] = IntervalTree(Interval(s, e, u) for u, s, e in v)
    
    with open('genemark.gff3') as f:
        for line in f:
            line = line.strip().split()
            if line[2] != 'gene': continue
            chr, start, end = line[0], int(line[3]), int(line[4])
            for start1, end1, un1 in d1[chr][start:end]:
                print un1, line[-1]

    The time test result is: it takes 10 seconds to build intervaltree, but the speed of the intersection process is improved by about 100 times.
    intervaltee reference https://pypi.python.org/pypi/...

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 09:18:12

    Don’t you need to close the file after opening it?

    reply
    0
  • PHP中文网

    PHP中文网2017-04-18 09:18:12

    from collections import defaultdict
    
    with open('1.txt', 'r') as f1, open('genemark.gff3', 'r') as f2:
        infile1 = defaultdict(set)
        for uno1, chr1, start1, end1, *others in map(str.split, f1):
            infile1[chr1].add((uno1, int(start1), int(end1)))
        infile2 = filter(lambda x: x[2] == 'gene', map(str.split, f2))
        for chr, start, end, info in map(lambda x: (x[0], int(x[3]), int(x[4]), x[-1]), infile2):
            for uno1, start1, end1 in infile1[chr]:
                if start1 < start < end1 or start1 < end < end or (start1 > start and end > end1):
                    print(uno1, info)

    Parallelization proposed by @ferstar on the 6th floor is the right direction, but there is something wrong with the code...
    Changed it:

    #!/usr/bin/env python
    # encoding: utf-8
    
    from collections import defaultdict
    from multiprocessing import Pool, cpu_count
    from functools import partial
    
    
    def find_sth(line, f1=None):
        line = line.split()
        if line[2] != 'gene':
            return
        start, end = int(line[3]), int(line[4])
        for uno1, start1, end1 in f1[line[0]]:
            if start1 < start < end1 or start1 < end < end or (start1 > start and end > end1):
                print(uno1, line[-1])
    
    
    def main():
        pool = Pool(cpu_count())
        with open('1.txt', 'r') as f1, open('genemark.gff3', 'r') as f2:
            infile1 = defaultdict(set)
            for uno1, chr1, start1, end1, *others in map(str.split, f1):
                infile1[chr1].add((uno1, int(start1), int(end1)))
            pool.map(partial(find_sth, f1=infile1), f2)
            #fw.writelines(filter(lambda x: x is not None, map(lambda x: x.get(), [pool.apply_async(func, (line,)) for line in f2])))
        pool.close()
        pool.join()
    
    
    if __name__ == "__main__":
        main()

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 09:18:12

    I found a very interesting thing. Everyone responded very positively, but as for the actual results, I just did a little test out of boredom. The process is as follows:

    Because the sample text provided by the question only has two lines, so I put 1.txtgenemark.gff3分别加倍到4000line

    (qiime) [ngs@cluster ~]$ wc -l 1.txt 
    4000 1.txt
    (qiime) [ngs@cluster ~]$ wc -l genemark.gff3 
    4000 genemark.gff3

    Sort by the number of floors replied, for example, the code of the subject is hi.py,然后一楼答主的代码是hi1.py, and so on

    • Let’s read the question first

    (qiime) [ngs@cluster ~]$ time python hi.py > hi.txt
    real    0m0.049s
    user    0m0.042s
    sys    0m0.007s
    (qiime) [ngs@cluster ~]$ wc -l hi.txt 
    6000 hi.txt

    It feels like there is repetition

    • The respondent on the first floor

    time python hi1.py > hi1.txt
    
    real    0m21.727s
    user    0m21.171s
    sys    0m0.547s
    (qiime) [ngs@cluster ~]$ wc -l hi1.txt 
    8000000 hi1.txt

    Went to grandma’s house again

    • The respondent on the second floor

    (qiime) [ngs@cluster ~]$ time python hi2.py > hi2.txt
    
    real    0m16.326s
    user    0m14.550s
    sys    0m1.044s
    (qiime) [ngs@cluster ~]$ wc -l hi2.txt 
    12000000 hi2.txt
    • From the respondent on the third floor

    (qiime) [ngs@cluster ~]$ time python hi3.py > hi3.txt
    
    real    0m27.079s
    user    0m26.281s
    sys    0m0.786s
    (qiime) [ngs@cluster ~]$ wc -l hi3.txt
    12000000 hi3.txt

    The result of the respondent on the third floor is the same as that on the second floor, but it is more than 10 seconds slower

    • Answer from the fourth floor (I have wronged my classmates on the fourth floor, this is the py3 code)

    (py3) [ngs@cluster ~]$ time python hi4.py > hi4.txt
    
    real    0m0.074s
    user    0m0.064s
    sys    0m0.010s
    (py3) [ngs@cluster ~]$ wc -l hi4.txt 
    4000 hi4.txt

    Sure enough, only through communication can we make progress. The current result is correct

    Summary

    In fact, it seems that the code results of the questioner are duplicated, and the answer of the fourth floor respondent is correct

    My plan--make small changes to the code on the fourth floor to be parallel

    There was something wrong with what I wrote, @yongchixiyongyan updated the correct parallel code, and my code will not be changed, for the convenience of reference for classmates who will see it later

    Direct coding (python3)

    from collections import defaultdict
    import multiprocessing
    
    
    def find_sth(x):
        with open('1.txt', 'r') as f1:
            infile1 = defaultdict(set)
            for uno1, chr1, start1, end1, *others in map(str.split, f1):
                infile1[chr1].add((uno1, int(start1), int(end1)))
            chr, start, end, info = x[0], int(x[3]), int(x[4]), x[-1]
            for uno1, start1, end1 in infile1[chr]:
                if start1 < start < end1 or start1 < end < end or (start1 > start and end > end1):
                    print(uno1, info)
    
    
    def main():
        with open('genemark.gff3', 'r') as fh:
            lst = [x for x in map(str.split, fh) if x[2] == 'gene']
        pool = multiprocessing.Pool(multiprocessing.cpu_count())
        pool.map(find_sth, lst)
        pool.close()
        pool.join()
    
    if __name__ == "__main__":
        main()

    Then take a look at the operating efficiency

    (py3) [ngs@cluster ~]$ time python hi_new.py > hi_new.txt
    real    0m3.033s
    user    0m31.952s
    sys    0m0.219s
    
    (py3) [ngs@cluster ~]$ wc -l hi_new.txt 
    4000 hi_new.txt

    It seems to be much slower in terms of time (4000 rows of data are only a few hundred KB). The questioner can try to test it with your real data. The larger the data processed, the more obvious the efficiency advantage of parallel processing

    PS: I estimate that the actual data size processed by the questioner must be in the MB or even GB level. Parallel processing at this level is the way to go

    Source data and result address link: http://pan.baidu.com/s/1hrSZQuS Password: u93n

    reply
    0
  • Cancelreply