c++ - 单机海量哈希去重算法

Question

单机环境，有大约1TB硬盘装满了md5哈希，里边有重复的，怎样才可能最快速度踢出重复的。内存大小限定为512MB吧 我实际遇到的一个问题，我去知乎提问了。居然被管理员封了，说我“代为解决个人问题”https://www.zhi...

黄舟 · Answer

I have done this before to remove repetitive sequences from hundreds of gigabytes of DNA sequences. It feels similar to this problem (assuming that your file has one hash per line). The buffsize is 30G (I ran it on the cluster for a day). I don’t know about you. How long will this 512M run...

sort -u -S buffsize -o unique_file file

巴扎黑 · Answer

I don’t know if I really understand that 1TB data. The space requirement I calculated according to your method is much larger than your space limit, but the time is much shorter than your estimate.
Considering MD5 as the optimal storage solution, the hard disk space occupied by each MD5 is $$frac{log_{2}1632}{8}=16$$. In this way, the entire 1TB hard disk has approximately $$610^{10}$$ MD5.
According to your method, under average circumstances, the space occupied is 1TB/256 (4GB), exceeding the 256MB limit. What's more, the distribution of the first two letters of MD5 is not necessarily even, so this value may be larger.
But the calculation time, I got the answer is about 16h. Taking into account the IO overhead and classification preparation, it cannot be more than two days.
Of course, in theory, classification does not need to be so troublesome. It is more convenient to sort directly externally and linearly remove duplicates. And the complexity of your method is $$O(nklog_{2}frac{n}{k})(k=256)$$, and the direct method is $$O(nlog_{2}n)$$, so in theory The latter is also lower in complexity. But with factors such as disk IO, I can't make any conclusions about which one is better.
So-called

ringa_lee · Answer

Is it possible to use Hadoop directly~

The hash value is 128 bits. As long as 1 bit is different, it is not a duplicate.
So, there is no need for a too complicated comparison algorithm, just extract a part of it for comparison.
For example, only compare the lower 64 bits of each hash value. This will filter out most values.

It is best to have two hard drives to avoid read and write conflicts.
The second empty hard drive is used as a flat space to mark duplicate values instead of copying hash values from drive A.

PHPz · Answer

1. I think this kind of questions appear very frequently, such as interviews and written test questions, so generally Google you can find more detailed answers.
2.hashThis algorithm should be used to remove duplicatesBloom Filter

怪我咯 · Answer

You need to use Bloom filter for this

怪我咯 · Answer

Let me give you an algorithm that ignores IO speed for now:

Divide the data evenly into 1024 2 or 1024 2.5 pieces; O(N)
Sort the hash value of each data file and quickly sort; O(NlogN)
Create a small root heap, and then establish 1024 * 2.5 file reading connections, one for each file;
For the first time, read 1024 * 2.5 md5s from the file and put them into the heap in sequence. You need to mark which file each md5 comes from;
Take out the top of the heap and output it to a file for storage (this md5 means the calculation is finished), but keep the top of the heap first;
According to the 4 recorded files, read another one from that file (it is possible that one md5 corresponds to multiple files, if there are multiple, just any file) and put it into the heap, and update the new one Which file does md5 come from;
Take out the top of the heap and compare it with the previous top of the heap of record 5. If they are the same, discard them; if they are different, store them in the file and update the record on the top of the heap and go to 6. Until all characters are finished. O(NlogN)

Total average complexity: O(NlogN).

There are several questions now. First, I don’t know if 0.5 G of data can be arranged in the memory at once, because I have never opened such a large array; second, if I can use a dictionary tree to solve the problem, The recording speed of 4 files will be faster; third, I don’t know whether it is possible to create thousands of file streams for reading. I have not tried it. If it cannot be created, you can create one; fourth, will the IO input and output work? It is very slow. For example, if you read 1 md5 each time and read 1TB, will it be unacceptably slow?

If the IO is too slow, this method may run for several days.

c++ - 单机海量哈希去重算法

reply all(6)I'll reply