Home  >  Q&A  >  body text

算法 - Python检测是否已有数据

现在的实现是一个字典类型,拥有500万条数据,KEY是40位的Hash
做的是从里面确定某个Hash是否存在,但是这样的方法内存占用太多了

准备尝试bloomfilter替换但是感觉增加数据有点麻烦,是否有其他类似的算法可以用?

==== 另一种介绍 ===
每次拿到一个HASH在列表中寻找,如果有,则停止执行,如果没有,则将该HASH添加到列表,继续重复执行。

问题在:内存/效率

大家讲道理大家讲道理2759 days ago686

reply all(10)I'll reply

  • 天蓬老师

    天蓬老师2017-04-17 13:19:05

    Because the hash has 40 digits and is a hexadecimal number, I replaced the letters with numbers and then converted them into numbers to save. This should save memory and the efficiency should be lower than O(n).
    My code:

    #!/usr/bin/env python
    #-*- coding:utf-8 -*-
    
    SHIFT = 5  # 如果计算机为32位,SHIFT为5;如果计算机为64位,SHIFT为6
    MASK = 0x1F  # 如果计算机为32位,MASK为0x1F;如果计算机为64位,MASK为0x3F
    
    class BitBucket(object):
        def __init__(self):
            self._unique_key_count = 0   # 唯一的key有多少个
            self._total_key_count = 0    # 加入的key有多少个
            self._bit = {}
            self._map = {'a': '1', 'b': '2', 'c': '3', 'd': '4', 'e': '5', 'f':'6'}
    
        def set(self, value):
            """return last bit"""
            value = self._translate(value)
            self._total_key_count += 1
    
            if not self._has_key(value):
                self._unique_key_count += 1
                key = value >> SHIFT
                self._bit[key] = self._bit.get(key, 0) | (1 << (value & MASK))
                return 0
            return 1
    
        def exist(self, value):
            value = self._translate(value)
            if self._has_key(value):
                return True
            return False
    
        def clear(self, value):
            value = self._translate(value)
            if self._has_key(value):
                self._unique_key_count -= 1
                self._total_key_count -= 1
    
                key = value >> SHIFT
                self._bit[key] = self._bit[key] & (~(1 << (value & MASK)))
                return True
            return False
    
        def get_total_count(self):
            return self._total_key_count
    
        def get_bit_count(self):
            return self._unique_key_count
    
        def _has_key(self, value):
            key = value >> SHIFT
            return self._bit.get(key, 0) & (1 << (value & MASK))
    
        def _translate(self, value):
            value = value.lower()
            return long(''.join([self._map.get(c, c) for c in value]))
    
    if __name__ == '__main__':
        bitBucket = BitBucket()
        bitBucket.set("a"*40)
        print bitBucket.exist("a" * 40)
        print bitBucket.exist("b" * 40)
    
        bitBucket.clear("a" * 40)
    
        import hashlib
    
        for i in range(1, 27):
            a = chr(i)
            sha1 = hashlib.sha1()
            sha1.update(a)
            bitBucket.set(sha1.hexdigest())
    
        print bitBucket.get_total_count() 
        print bitBucket.get_bit_count()
    
        count = 0
        for i in range(1, 30):
            a = chr(i)
            sha1 = hashlib.sha1()
            sha1.update(a)
            if bitBucket.exist(sha1.hexdigest()):
                count += 1
    
        assert count == bitBucket.get_bit_count()
    

    Or you can consider using a dictionary tree to do it. It is best to use C++ to do it, but the efficiency and memory can be improved!

    reply
    0
  • PHP中文网

    PHP中文网2017-04-17 13:19:05

    If you use bloomfilter, it will introduce a certain error rate. It depends on whether your project can be accepted. If so, this is the best choice.

    If that doesn’t work, just get a trie tree. Marisa is recommended to save space.

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 13:19:05

    The first reaction is to use tuples, but I don’t know how efficient it is. Can you try it?

    #!/usr/bin/env python3
    data = {"a":1, "b":2, "c":3, "d":4, "a":5, "c":6}
    
    data.keys()
    

    t should be a unique hash key tuple.

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-17 13:19:05

    Decisive bloom filter, simple to implement, small memory, and most importantly, high efficiency
    Java version

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-17 13:19:05

    The method in the link below is for reference: https://github.com/qiwsir/algorithm/blob/master/same_element_in_list.md

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-17 13:19:05

    Assume that the data with a length of 5 million is a dictionary source_dict and what needs to be judged is a list hash_list, then:
    result = [item for item in hash_list if item in source_dict]

    source_dict must be loaded into memory first. If it occupies memory, you can first source_dict.keys() get the key list. Assuming it is source_keys, then:
    result = [item for item in hash_list if item in source_keys].

    Considering that the dictionary traversal speed is O(1), the list is O(n), and the amount of data here is 5 million, so method one is recommended.

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-17 13:19:05

    You can try to use MapReduce to solve it, please refer to:
    Implementing MapReduce with multiprocessing

    reply
    0
  • 高洛峰

    高洛峰2017-04-17 13:19:05

    Use the bsddb module. Although it is not a standard library, it is still a common python module.

    bucket = bsddb.btopen(None)
    

    or

    bucket = bsddb.hashopen(dbfilename)
    

    When using a disk, the storage object can also be pickled and directly used as a key

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-17 13:19:05

    Idea: python’s object mechanism determines that python will definitely not save as much memory as C. Each str will occupy an extra part of the memory

    • If it must be stored in memory, consider redis, which is a good choice regardless of algorithm or memory
    • If it can be placed on disk, bsddb should be a good choice

    In the final analysis, what needs to be considered is the architecture. In this era, there is almost no need to operate the algorithm yourself

    reply
    0
  • 迷茫

    迷茫2017-04-17 13:19:05

    If it is a 40-digit hexadecimal hash (I guess it may be sha1), it is a bit wasteful for 5 million data.

    In other words, instead of indexing a 40-digit hexadecimal string, it is better to consider how to index a 5 million-scale string.

    reply
    0
  • Cancelreply