Home  >  Q&A  >  body text

c++ - 单机海量哈希去重算法

单机环境,有大约1TB硬盘装满了md5哈希,里边有重复的,怎样才可能最快速度踢出重复的。内存大小限定为512MB吧

我实际遇到的一个问题,我去知乎提问了。居然被管理员封了,说我“代为解决个人问题”
https://www.zhihu.com/questio...

我把我知乎的回复抄过来

居然有人投诉说"代为完成个人任务",也是醉了。这个问题不是面试题,也不是啥个人问题。是我实实在在遇到的问题。我有个思路,实现也简单,但是预估计跑完得个把星期了。

现在有两T,1TB是数据,另外1TB用来存放结果。

我现在没有啥特别好的思路。只有最简单的优化。
分文件

A分区存1TB数据,B分区空闲

MD5hash 特点0~9a-f 总共16种情况 总共32个字符。

第一步分级
分两级

第一级 0·f 总公16个目录
第二级 0-f 总共16 个文件

那么这样下来我们总共会有256个文件

从A分区中顺序读取哈希,哈希的第一个字母情况根据目录进行分类;
哈希的第二个字母进行文件分类;对于任意一个A分区中的数据,会通过前两个字母定位
到这256个文件中的某一个;

顺序遍历A分区数据,完成全部操作之后会生成256个文件。

第二步
此时就转化成为将256个小一点的文件进行去重;

对于小一点的文件去重;

  1. 分块读入内存进行堆排序,进行一次顺序遍历则可以剔除重复数据;

  2. 对于两个已经分块的顺序数据合并去重;这种情况相当于“”有序单链表合并问题“”
    很容易做到;

  3. 依次重复1.2两步可以将这256个小文件去重;此时得到结果

这是我想的一个简单办法,但是估算了一下时间可能要个把星期。。。好久

高洛峰高洛峰2715 days ago710

reply all(6)I'll reply

  • 黄舟

    黄舟2017-04-17 14:44:33

    I have done this before to remove repetitive sequences from hundreds of gigabytes of DNA sequences. It feels similar to this problem (assuming that your file has one hash per line). The buffsize is 30G (I ran it on the cluster for a day). I don’t know about you. How long will this 512M run...

    sort -u -S buffsize -o unique_file file

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-17 14:44:33

    I don’t know if I really understand that 1TB data. The space requirement I calculated according to your method is much larger than your space limit, but the time is much shorter than your estimate.
    Considering MD5 as the optimal storage solution, the hard disk space occupied by each MD5 is $$frac{log_{2}1632}{8}=16$$. In this way, the entire 1TB hard disk has approximately $$610^{10}$$ MD5.
    According to your method, under average circumstances, the space occupied is 1TB/256 (4GB), exceeding the 256MB limit. What's more, the distribution of the first two letters of MD5 is not necessarily even, so this value may be larger.
    But the calculation time, I got the answer is about 16h. Taking into account the IO overhead and classification preparation, it cannot be more than two days.
    Of course, in theory, classification does not need to be so troublesome. It is more convenient to sort directly externally and linearly remove duplicates. And the complexity of your method is $$O(nklog_{2}frac{n}{k})(k=256)$$, and the direct method is $$O(nlog_{2}n)$$, so in theory The latter is also lower in complexity. But with factors such as disk IO, I can't make any conclusions about which one is better.
    So-called

    reply
    0
  • ringa_lee

    ringa_lee2017-04-17 14:44:33

    Is it possible to use Hadoop directly~


    The hash value is 128 bits. As long as 1 bit is different, it is not a duplicate.
    So, there is no need for a too complicated comparison algorithm, just extract a part of it for comparison.
    For example, only compare the lower 64 bits of each hash value. This will filter out most values.

    It is best to have two hard drives to avoid read and write conflicts.
    The second empty hard drive is used as a flat space to mark duplicate values ​​instead of copying hash values ​​from drive A.

    reply
    0
  • PHPz

    PHPz2017-04-17 14:44:33

    1. I think this kind of questions appear very frequently, such as interviews and written test questions, so generally Google you can find more detailed answers.
    2.hashThis algorithm should be used to remove duplicatesBloom Filter

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 14:44:33

    You need to use Bloom filter for this

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 14:44:33

    Let me give you an algorithm that ignores IO speed for now:

    1. Divide the data evenly into 1024 2 or 1024 2.5 pieces; O(N)

    2. Sort the hash value of each data file and quickly sort; O(NlogN)

    3. Create a small root heap, and then establish 1024 * 2.5 file reading connections, one for each file;

    4. For the first time, read 1024 * 2.5 md5s from the file and put them into the heap in sequence. You need to mark which file each md5 comes from;

    5. Take out the top of the heap and output it to a file for storage (this md5 means the calculation is finished), but keep the top of the heap first;

    6. According to the 4 recorded files, read another one from that file (it is possible that one md5 corresponds to multiple files, if there are multiple, just any file) and put it into the heap, and update the new one Which file does md5 come from;

    7. Take out the top of the heap and compare it with the previous top of the heap of record 5. If they are the same, discard them; if they are different, store them in the file and update the record on the top of the heap and go to 6. Until all characters are finished. O(NlogN)

    Total average complexity: O(NlogN).

    There are several questions now. First, I don’t know if 0.5 G of data can be arranged in the memory at once, because I have never opened such a large array; second, if I can use a dictionary tree to solve the problem, The recording speed of 4 files will be faster; third, I don’t know whether it is possible to create thousands of file streams for reading. I have not tried it. If it cannot be created, you can create one; fourth, will the IO input and output work? It is very slow. For example, if you read 1 md5 each time and read 1TB, will it be unacceptably slow?

    If the IO is too slow, this method may run for several days.

    reply
    0
  • Cancelreply