PHPz2017-04-18 10:01:45
Write an idea
candidates = [
'GGGAGCAGGCAAGGACTCTG',
'GCTCGGGCTTGTCCACAGGA',
'...',
# 被你看出来啦,这些其实人类基因的片段
]
bg_db = [
'CTGCTGACGGGTGACACCCA',
'AGGAACTGGTGCTTGATGGC',
'...',
# 这个更多,有十亿左右
]
Because your data is actually very unique, it can be streamlined here.
Because all strings are 20 characters in length and consist of ATCG
four characters. Then they can be converted into integers for comparison.
The binary representation is as follows
A ==> 00
T ==> 01
C ==> 10
G ==> 11
Because the length of a string is fixed and each character can be represented by 2 bits, each string can be expressed as a 40
位的整数。可以表示为32+8
的形式,也可以直接使用64
位整形。建议使用C
language.
Let’s talk about comparison.
Because we need to find all records with a difference of less than or equal to 4 for each candidate in bg_db
, so we only need to do ^
bitwise XOR of two integers Operation, the binary 1 in the result does not exceed 8, and these 8 1s can only be divided into 4 groups at most
, it is possible that it meets the requirements (00^11 =11,10^01=11). 每一个candidate在bg_db中与之差异小于等于4的所有记录
,所以只要两个整数一做^
按位异或操作,结果中二进制中1不超过8个,且这不超过8个1最多只能分为4个组
的才有可能是符合要求的(00^11=11,10^01=11)。
把结果的40
个比特位分作20个组,那么就是说最多只有4
个组为b01
b10
b11
这三个值,其余的全部为b00
。
那么比较算法就很好写了。
可以对每个字节(四个组)获取其中有几个组是为三个非零值的,来简介获取整体的比较结果。
因为每个字节只有256
种可能的值,而符合条件的值只有Then the comparison algorithm is easy to write. 3^4=81
Divide the 40
bits of the result into 20 groups, which means that there are only 4
groups at most b01
b10< /code>
b11
These three values, the rest are all b00
.
You can obtain for each byte (four groups) how many groups have three non-zero values to briefly obtain the overall comparison result.
256
possible values 3^4=81
🎜, so the result can be stored first Get up and get it. 🎜Here is a function to get how many non-zero groups there are in the result. 🎜
/*****************下面table中值的生成******//**
int i;
for( i=0;i<256;++i){
int t =0;
t += (i&0x01 || i&0x02)?1:0;
t += (i&0x04 || i&0x08)?1:0;
t += (i&0x10 || i&0x20)?1:0;
t += (i&0x40 || i&0x80)?1:0;
printf("%d,",t);
if(i%10 ==9){putchar('\n');}
}
********************************************//
int table[] = {
0,1,1,1,1,2,2,2,1,2,
2,2,1,2,2,2,1,2,2,2,
2,3,3,3,2,3,3,3,2,3,
3,3,1,2,2,2,2,3,3,3,
2,3,3,3,2,3,3,3,1,2,
2,2,2,3,3,3,2,3,3,3,
2,3,3,3,1,2,2,2,2,3,
3,3,2,3,3,3,2,3,3,3,
2,3,3,3,3,4,4,4,3,4,
4,4,3,4,4,4,2,3,3,3,
3,4,4,4,3,4,4,4,3,4,
4,4,2,3,3,3,3,4,4,4,
3,4,4,4,3,4,4,4,1,2,
2,2,2,3,3,3,2,3,3,3,
2,3,3,3,2,3,3,3,3,4,
4,4,3,4,4,4,3,4,4,4,
2,3,3,3,3,4,4,4,3,4,
4,4,3,4,4,4,2,3,3,3,
3,4,4,4,3,4,4,4,3,4,
4,4,1,2,2,2,2,3,3,3,
2,3,3,3,2,3,3,3,2,3,
3,3,3,4,4,4,3,4,4,4,
3,4,4,4,2,3,3,3,3,4,
4,4,3,4,4,4,3,4,4,4,
2,3,3,3,3,4,4,4,3,4,
4,4,3,4,4,4
};
int getCount(uint64_t cmpresult)
{
uint8_t* pb = &cmpresult; // 这里假设是小段模式,且之前比较结果是存在低40位
return table[pb[0]]+table[pb[1]]+table[pb[2]]+table[pb[3]]+table[pb[4]];
}
阿神2017-04-18 10:01:45
First of all, your time estimate is completely wrong. This kind of large-scale data processing needs to run tens of thousands of items and last for more than ten seconds before it can be used for multiplication to calculate the total time. If only one item is counted, this time will be almost They are all the overhead of the initialization process, not the key IO and CPU overhead
The following text
The four possibilities of ACTG are equivalent to 2 bits. It is too wasteful to use one character to represent one gene position. One character is 8 bits and can hold 4 gene positions
Even if you don’t use any algorithm and just write your 20 genes into binary form, you can save 5 times the time
In addition, after looping 20 times, the number of CPU instructions is 20*n, and n is estimated to be at least 3. But for binary, the XOR operation for comparison is directly a CPU instruction, and the number of instructions is 1
巴扎黑2017-04-18 10:01:45
I don’t know much about the algorithm, but speaking from experience, complex algorithms take longer and are not as fast as this simple and crude one
You can consider multi-threading and clustering to process data
By the way, the Hamming distance seems to be able to calculate this
伊谢尔伦2017-04-18 10:01:45
Also no algorithm is used, brute force solution is written in C
On my machine (CPU: Core 2 Duo E7500, RAM: 4G, OS: Fedora 19), test results
candidates bg.db cost
10000 1000000 318758165微秒
500 1000000 14950302微秒
If you switch to the subject's 24-core CPU, the performance will be improved by 20 times, and then add 48 machines to operate together. 5000W operations will take 15 seconds, and the time will be
10000000 * 1000000000 / 500 / 1000000 * 15 / 20 / 48 / 3600 / 24 = 3.616898 days
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <sys/time.h>
#define START_CC(flag) \
struct timeval st_##flag##_beg; \
gettimeofday(&st_##flag##_beg, 0)
#define STOP_CC(flag) \
struct timeval st_##flag##_end; \
gettimeofday(&st_##flag##_end, 0)
#define PRINT_CC(flag) \
double cost_##flag = 0.0L; \
cost_##flag = (double)(st_##flag##_end.tv_sec - st_##flag##_beg.tv_sec); \
cost_##flag = cost_##flag * 1000000L + (double)(st_##flag##_end.tv_usec - st_##flag##_beg.tv_usec); \
printf(#flag" cost time %.6lf microsecond.\n", cost_##flag);
#define GENEORDER_CODE_LENGTH 20 + 1
typedef struct GeneOrder
{
char code[GENEORDER_CODE_LENGTH];
}GeneOrder, *GeneOrderPtr;
typedef struct GOArray
{
size_t capacity;
size_t length;
GeneOrderPtr data;
}GOArray;
GOArray createGOAarray(size_t capacity)
{
GOArray goa;
goa.capacity = capacity;
goa.length = 0;
goa.data = (GeneOrderPtr)malloc(capacity * sizeof(GeneOrder));
return goa;
}
void destroyGOArray(GOArray* goa)
{
if (goa->capacity > 0) {
free(goa->data);
}
}
bool readGOFile(char const *file, GOArray *goarray)
{
FILE* fp = NULL;
if ((fp = fopen(file, "r+")) == NULL) {
return false;
}
char buff[64];
while (fgets(buff, 64, fp) != NULL) {
if (goarray->length < goarray->capacity) {
memcpy(goarray->data[goarray->length].code,
buff,
GENEORDER_CODE_LENGTH * sizeof(char)
);
goarray->data[goarray->length].code[GENEORDER_CODE_LENGTH - 1] = 'gcc -Wall -Wextra -o ccs main.c -std=c99 -Os && ./ccs candidate.list bg.db
';
goarray->length ++;
} else {
fclose(fp);
return true;
}
}
fclose(fp);
return true;
}
int main(int argc, char* argv[])
{
(void)argc;
GOArray condgo = createGOAarray(10000);
GOArray bggo = createGOAarray(1000000);
printf("loading ...\n");
START_CC(loading);
if (!readGOFile(argv[1], &condgo) || !readGOFile(argv[2], &bggo)) {
destroyGOArray(&condgo);
destroyGOArray(&bggo);
return -1;
}
STOP_CC(loading);
PRINT_CC(loading);
int count = 0;
START_CC(compare);
for (size_t i = 0;i < 500;i ++) {
const GeneOrderPtr gop = condgo.data + i;
for (size_t j = 0;j < bggo.length;j ++) {
const GeneOrderPtr inner_gop = bggo.data + j;
int inner_count = 0;
for (size_t k = 0;k < 20;k ++) {
if (gop->code[k] != inner_gop->code[k]) {
if (++inner_count > 4) {
break;
}
}
}
if (inner_count <= 4) {
#ifdef DBGPRINT
printf("%d %s - %s\n", i, gop->code, inner_gop->code);
#endif
count++;
}
}
}
STOP_CC(compare);
PRINT_CC(compare);
printf("result = %d\n", count);
destroyGOArray(&condgo);
destroyGOArray(&bggo);
return 0;
}
Compile parameters & run
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <sys/time.h>
#include <pthread.h>
#define START_CC(flag) \
struct timeval st_##flag##_beg; \
gettimeofday(&st_##flag##_beg, 0)
#define STOP_CC(flag) \
struct timeval st_##flag##_end; \
gettimeofday(&st_##flag##_end, 0)
#define PRINT_CC(flag) \
double cost_##flag = 0.0L; \
cost_##flag = (double)(st_##flag##_end.tv_sec - st_##flag##_beg.tv_sec); \
cost_##flag = cost_##flag * 1000000L + (double)(st_##flag##_end.tv_usec - st_##flag##_beg.tv_usec); \
printf(#flag " cost time %.6lf microsecond.\n", cost_##flag);
#define GENEORDER_CODE_LENGTH 20 + 1
typedef struct GeneOrder {
char code[GENEORDER_CODE_LENGTH];
} GeneOrder, *GeneOrderPtr;
typedef struct GOArray {
size_t capacity;
size_t length;
GeneOrderPtr data;
} GOArray;
GOArray createGOAarray(size_t capacity)
{
GOArray goa;
goa.capacity = capacity;
goa.length = 0;
goa.data = (GeneOrderPtr)malloc(capacity * sizeof(GeneOrder));
return goa;
}
void destroyGOArray(GOArray* goa)
{
if (goa->capacity > 0) {
free(goa->data);
}
}
bool readGOFile(char const* file, GOArray* goarray)
{
FILE* fp = NULL;
if ((fp = fopen(file, "r+")) == NULL) {
return false;
}
char buff[64];
while (fgets(buff, 64, fp) != NULL) {
if (goarray->length < goarray->capacity) {
memcpy(goarray->data[goarray->length].code, buff,
GENEORDER_CODE_LENGTH * sizeof(char));
goarray->data[goarray->length].code[GENEORDER_CODE_LENGTH - 1] = 'gcc -Wall -Wextra -o ccs main.c -std=c99 -O3 -lpthread && ./ccs candidate.list bg.db
';
goarray->length++;
}
else {
fclose(fp);
return true;
}
}
fclose(fp);
return true;
}
typedef struct ProcessST {
GOArray* pcond;
GOArray* pbg;
size_t beg;
size_t end; // [beg, end)
} ProcessST;
void* processThread(void* parg)
{
ProcessST* ppst = (ProcessST*)parg;
GOArray* pcond = ppst->pcond;
GOArray* pbg = ppst->pbg;
int count = 0;
for (size_t i = ppst->beg; i < ppst->end; i++) {
const GeneOrderPtr gop = pcond->data + i;
for (size_t j = 0; j < pbg->length; j++) {
const GeneOrderPtr inner_gop = pbg->data + j;
int inner_count = 0;
for (size_t k = 0; k < 20; k++) {
if (gop->code[k] != inner_gop->code[k]) {
if (++inner_count > 4) {
break;
}
}
}
if (inner_count <= 4) {
#ifdef DBGPRINT
printf("%d %s - %s\n", i, gop->code, inner_gop->code);
#endif
count++;
}
}
}
return (void*)count;
}
int main(int argc, char* argv[])
{
(void)argc;
GOArray condgo = createGOAarray(10000);
GOArray bggo = createGOAarray(1000000);
printf("loading ...\n");
START_CC(loading);
if (!readGOFile(argv[1], &condgo) || !readGOFile(argv[2], &bggo)) {
destroyGOArray(&condgo);
destroyGOArray(&bggo);
return -1;
}
STOP_CC(loading);
PRINT_CC(loading);
size_t range[] = { 0, 250, 500 };
pthread_t thr[2] = { 0 };
ProcessST pst[2];
START_CC(compare);
for (size_t i = 0; i < 2; i++) {
pst[i].pcond = &condgo;
pst[i].pbg = &bggo;
pst[i].beg = range[i];
pst[i].end = range[i + 1];
pthread_create(&thr[i], NULL, processThread, &pst[i]);
}
int count = 0;
int* ret = NULL;
for (size_t i = 0; i < 2; i++) {
pthread_join(thr[i], (void**)&ret);
count += (int)ret;
}
STOP_CC(compare);
PRINT_CC(compare);
printf("result = %d\n", count);
destroyGOArray(&condgo);
destroyGOArray(&bggo);
return 0;
}
If you change it to multi-threading, the speed will be faster. On my machine, the performance improvement is not very big, but newer CPUs have hyper-threading technology, and the speed is expected to be better. . . 2
个线程简单使用500条candidates
测试,速度可以提升到9040257微秒
,线程增加到4
rrreee
rrreee
巴扎黑2017-04-18 10:01:45
Sorry, I saw someone else replying today.
Looking at the question carefully, I found that I thought it was just a match.
So I proposed to use ac automatic machine.
But the purpose of the question is to find the difference in the sequence.
This is to find the edit distance between the two.
wiki: Edit distance
wiki: Ravenstein distance
When I used to brush OJ, I used DP (dynamic programming) to find the minimum number of edits to convert a string into another string.
for(i:0->len)
for(j:0->len)
str1[i]==str2[j] ? cost=0 : cost=1
dp[i,j]=min(
dp[i-1, j ] + 1, // 刪除
dp[i , j-1] + 1, // 插入
dp[i-1, j-1] + cost // 替換
)
For example:
str1 "abcdef"
str2 "acddff"
str2 converted to str1
Insert b Count once
Delete d Count once
Modify f Count once
For the subject’s ATCG gene sequence, do you only need to find the modified one?
However, how should it be calculated like this
ATCGATCG
TCGATCGA
?
If you just find modifications, just compare str1[i] and str2[i] directly.
for(i:0->len)
if(str1[i]!=str2[i] )++count;
Inspired by @rockford.
We can preprocess the raw data.
candidates
GGGAGCAGGCAAGGACTCTG
A5 T2 C4 G9
Extra data after processing A5 T2 C4 G9
String inbg_db
CTGCTGACGGGTGACACCCA
A4 T3 C7 G6
Extra data after processing A4 T3 C7 G6
A5 -> A4 is recorded as -1
T2 -> T3 is recorded as +1
C4 -> C7 is recorded as +3
G9 -> G6 is recorded as -3
Obviously A can only become TCG if modified.
Similarly, we only need to count all + or all -
to know at least how many differences they have.
Anything greater than 4 does not need to be compared.
By first comparing the pre-processed additional data, and then performing the comparison through a single comparison algorithm.
(work overtime on Saturday, write this after get off work)
天蓬老师2017-04-18 10:01:45
Your individual tasks are determined. What you need is to send these tasks to workers. Such calculations are not performed synchronously in a single process.
In fact, it is equivalent to you having [a, b] and [c, d] to compare, your task is
[a, c]
[a, d]
[b, c]
[b, d]
If you are synchronous serial, the time you need is 4 * single time
If you have 4 CPUs or 4 machines in parallel, your time is almost single time
So calculations like genomes are basically completed using multi-core parallel tasks on large machines. Basically, the reference principles are the principles of the paper Google MapReduce
黄舟2017-04-18 10:01:45
I’m not good at algorithms, but for a large amount of data like yours, a computer is definitely not enough to compare. For CPU-intensive data tasks like yours, I agree with what others have said about using a cluster or multi-process method to calculate, which is what we Use the map-reduce model to calculate
map is mapping. You first map each of your candidates to bg_db one by one to form a data structure like this(candidates,bg_db)
Make it into a queue and then hand it over to different servers. Each server uses multiple processes to do it. This is the only way to calculate, but the amount of data you have is too large. Find a way to allocate your tasks and calculate in parallel,
PHP中文网2017-04-18 10:01:45
You can try to use a dictionary tree to save all strings. Then you can use the method of traversing the dictionary tree when querying.
When traversing the tree, you can maintain a sequence of current nodes. This sequence stores the number of mismatches between the currently traversed node and the corresponding node.
When traversing the next node, try to go down all the nodes in the current sequence and form a new node sequence.
The advantage is that the current bits of many strings can be compared together, which can save some time. Since there are not many choices for each position and the mismatch is not large, there should be no excessive expansion of the current node sequence. (This is a guess... I haven’t verified it too carefully...)
def match(candidate, root):
nset = [[],[]]
currentId = 0
nset[currentId].append((root, 0))
for ch in candidate:
nextId = 1 - currentId
for item in nset[currentId]:
node, mismatch = item
for nx in node.child:
if nx.key == ch:
nset[nextId].append((nx, mismatch))
else:
if mismatch:
nset[nextId].append((nx, mismatch - 1))
nset[currentId].clear()
currentId = 1 - currentId
return nset[currentId]
The above code is a rough idea. It would be much faster if written in C++.
The entire process can be done using clusters for distributed computing.
PHP中文网2017-04-18 10:01:45
Visually the questioner does not have multiple machines for him to calculate.
I have a simple idea. Calculate the sum of the alphabetical numbers of each string (A:0, T:1, C:2, G:3). First calculate the two The difference between the ordinal sum of the string cannot exceed 12 at most. Four A's become four G's. If the difference is less than 12, it will be processed.
It's just a rough idea. The specific weight can be set separately. In theory, it can Much faster.
There is another algorithm that calculates the string edit distance (the minimum number of edits to add, delete, and modify one string to another string). I can’t remember it at the moment. You can check it yourself.