Heim > Fragen und Antworten > Hauptteil
第一列(scaffold这一列)相同,则根据第AS列(AS:i:xx)数字 xx的大小,保留数字最大的行。如果数字大小相同则都保留。
举例,
输入文件
scaffold_010679_1AL.2 16 chr1A 429400034 119 3272M * GACACAAGAGACTCTTTG * AS:i:3268 XS:i:2147 XF:i:0 XE:i:29 NM:i:1
scaffold_010679_1AL.2 16 chr1A 429400034 119 3272M * GACACAAGAGACTCTTTG * AS:i:3268 XS:i:2147 XF:i:0 XE:i:29 NM:i:1
scaffold_010679_1AL.2 16 chr1A 429400034 119 3272M * GACACAAGAGACTCTTTG * AS:i:1268 XS:i:2147 XF:i:0 XE:i:29 NM:i:1
scaffold_010679_1AL.3 16 chr1A 429397743 19 599S1730M1I279M * 0 0 TGCCGAGGTTTTTGA * AS:i:1998 XS:i:1877 XF:i:3 XE:i:20 NM:i:2 XN:i:1
scaffold_010679_1AL.3 16 chr1A 429397743 19 599S1730M1I279M * 0 0 TGCCGAGGTTTTTGA * AS:i:1098 XS:i:1877 XF:i:3 XE:i:20 NM:i:2 XN:i:1
结果文件
scaffold_010679_1AL.2 16 chr1A 429400034 119 3272M * GACACAAGAGACTCTTTG * AS:i:3268 XS:i:2147 XF:i:0 XE:i:29 NM:i:1
scaffold_010679_1AL.2 16 chr1A 429400034 119 3272M * GACACAAGAGACTCTTTG * AS:i:3268 XS:i:2147 XF:i:0 XE:i:29 NM:i:1
scaffold_010679_1AL.3 16 chr1A 429397743 19 599S1730M1I279M * 0 0 TGCCGAGGTTTTTGA * AS:i:1998 XS:i:1877 XF:i:3 XE:i:20 NM:i:2 XN:i:1
迷茫2017-04-18 10:36:19
# coding: utf-8
from itertools import groupby
with open('a.txt') as f:
data = [line for line in f]
#因为数据的列数不相同, 只能以AS:i:为开头来识别
#取第一列为key, AS:i:列为value
lst = [(l.split()[0], _) for l in data for _ in l.split() if _.startswith('AS:i:')]
#找出同key下的max(value)
max_lst = [max(list(g)) for k, g in groupby(lst, lambda x: x[0])]
#从原数据里找到同时包含key和value的行
print [line for line in data for _ in max_lst if _[0] in line and _[1] in line]
ringa_lee2017-04-18 10:36:19
awk '{n=gensub(".*AS:i:([0-9]+).*","\\1","g")}n>=k[$1]{c[$1]=n==k[$1]?c[$1]"\n"$0:$0;k[$1]=n}END{for(i in c)print c[i]}' file
伊谢尔伦2017-04-18 10:36:19
grep "`sort -r -t "*" -k 3 b.txt | head -1 |awk -F "*" '{split($3,a," ");print a[1]}'
`" b.txt
思路文件按星号*分列分3列,按照第三列降序排序,取出第一行,取出AS:i:最大数,grep搜索之,得到结果。
是我没仔细看提问,失误了~~结果不对