现在的实现是一个字典类型,拥有500万条数据,KEY是40位的Hash
做的是从里面确定某个Hash是否存在,但是这样的方法内存占用太多了
准备尝试bloomfilter替换但是感觉增加数据有点麻烦,是否有其他类似的算法可以用?
==== 另一种介绍 ===
每次拿到一个HASH在列表中寻找,如果有,则停止执行,如果没有,则将该HASH添加到列表,继续重复执行。
问题在:内存/效率
天蓬老师2017-04-17 13:19:05
Because the hash has 40 digits and is a hexadecimal number, I replaced the letters with numbers and then converted them into numbers to save. This should save memory and the efficiency should be lower than O(n).
My code:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
SHIFT = 5 # 如果计算机为32位,SHIFT为5;如果计算机为64位,SHIFT为6
MASK = 0x1F # 如果计算机为32位,MASK为0x1F;如果计算机为64位,MASK为0x3F
class BitBucket(object):
def __init__(self):
self._unique_key_count = 0 # 唯一的key有多少个
self._total_key_count = 0 # 加入的key有多少个
self._bit = {}
self._map = {'a': '1', 'b': '2', 'c': '3', 'd': '4', 'e': '5', 'f':'6'}
def set(self, value):
"""return last bit"""
value = self._translate(value)
self._total_key_count += 1
if not self._has_key(value):
self._unique_key_count += 1
key = value >> SHIFT
self._bit[key] = self._bit.get(key, 0) | (1 << (value & MASK))
return 0
return 1
def exist(self, value):
value = self._translate(value)
if self._has_key(value):
return True
return False
def clear(self, value):
value = self._translate(value)
if self._has_key(value):
self._unique_key_count -= 1
self._total_key_count -= 1
key = value >> SHIFT
self._bit[key] = self._bit[key] & (~(1 << (value & MASK)))
return True
return False
def get_total_count(self):
return self._total_key_count
def get_bit_count(self):
return self._unique_key_count
def _has_key(self, value):
key = value >> SHIFT
return self._bit.get(key, 0) & (1 << (value & MASK))
def _translate(self, value):
value = value.lower()
return long(''.join([self._map.get(c, c) for c in value]))
if __name__ == '__main__':
bitBucket = BitBucket()
bitBucket.set("a"*40)
print bitBucket.exist("a" * 40)
print bitBucket.exist("b" * 40)
bitBucket.clear("a" * 40)
import hashlib
for i in range(1, 27):
a = chr(i)
sha1 = hashlib.sha1()
sha1.update(a)
bitBucket.set(sha1.hexdigest())
print bitBucket.get_total_count()
print bitBucket.get_bit_count()
count = 0
for i in range(1, 30):
a = chr(i)
sha1 = hashlib.sha1()
sha1.update(a)
if bitBucket.exist(sha1.hexdigest()):
count += 1
assert count == bitBucket.get_bit_count()
Or you can consider using a dictionary tree to do it. It is best to use C++ to do it, but the efficiency and memory can be improved!
PHP中文网2017-04-17 13:19:05
If you use bloomfilter, it will introduce a certain error rate. It depends on whether your project can be accepted. If so, this is the best choice.
If that doesn’t work, just get a trie tree. Marisa is recommended to save space.
怪我咯2017-04-17 13:19:05
The first reaction is to use tuples, but I don’t know how efficient it is. Can you try it?
#!/usr/bin/env python3
data = {"a":1, "b":2, "c":3, "d":4, "a":5, "c":6}
data.keys()
t
should be a unique hash key tuple.
天蓬老师2017-04-17 13:19:05
Decisive bloom filter, simple to implement, small memory, and most importantly, high efficiency
Java version
大家讲道理2017-04-17 13:19:05
The method in the link below is for reference: https://github.com/qiwsir/algorithm/blob/master/same_element_in_list.md
伊谢尔伦2017-04-17 13:19:05
Assume that the data with a length of 5 million is a dictionary source_dict
and what needs to be judged is a list hash_list
, then: result = [item for item in hash_list if item in source_dict]
source_dict
must be loaded into memory first. If it occupies memory, you can first source_dict.keys()
get the key list. Assuming it is source_keys
, then: result = [item for item in hash_list if item in source_keys]
.
Considering that the dictionary traversal speed is O(1), the list is O(n), and the amount of data here is 5 million, so method one is recommended.
天蓬老师2017-04-17 13:19:05
You can try to use MapReduce to solve it, please refer to:
Implementing MapReduce with multiprocessing
高洛峰2017-04-17 13:19:05
Use the bsddb module. Although it is not a standard library, it is still a common python module.
bucket = bsddb.btopen(None)
or
bucket = bsddb.hashopen(dbfilename)
When using a disk, the storage object can also be pickled and directly used as a key
大家讲道理2017-04-17 13:19:05
Idea: python’s object mechanism determines that python will definitely not save as much memory as C. Each str will occupy an extra part of the memory
In the final analysis, what needs to be considered is the architecture. In this era, there is almost no need to operate the algorithm yourself
迷茫2017-04-17 13:19:05
If it is a 40-digit hexadecimal hash (I guess it may be sha1), it is a bit wasteful for 5 million data.
In other words, instead of indexing a 40-digit hexadecimal string, it is better to consider how to index a 5 million-scale string.