Home >Backend Development >PHP Tutorial >PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval
The Bloom filter is a space-efficient data structure used to determine whether an element belongs to a set. It uses a hash function and a bit array to efficiently find if the element is present, possibly with false positives. It is suitable for scenarios where a large number of elements need to be retrieved quickly, such as URL duplicate detection.
PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval
Introduction
The Bloom filter is a highly space-efficient data structure used to determine whether an element belongs to a set. It uses a hash function and a bit array to efficiently find whether the element is present.
Principle
The Bloom filter initializes a bit array, and each position is initially 0. Then, the elements are hashed using multiple hash functions, the bit array is indexed with the hash value, and the value at that position is set to 1.
If an element belongs to the set, its hash value will find at least one position in the bit array that is 1. However, it is also possible to find a position of 1 for an element that does not belong to the set, called a false positive.
Implementation
class BloomFilter { // 过滤器大小 (位数) private $size; // 哈希函数个数 private $numHashes; // 哈希函数 private $hashFunctions; // 过滤器位数组 private $filter; // 初始化布隆过滤器 public function __construct($size, $numHashes) { $this->size = $size; $this->numHashes = $numHashes; $this->initHashFunctions(); $this->filter = array_fill(0, $this->size, 0); } // 初始化哈希函数 private function initHashFunctions() { $this->hashFunctions = []; for ($i = 0; $i < $this->numHashes; $i++) { $this->hashFunctions[] = function ($key) use ($i) { return abs(crc32($key . $i)); }; } } // 向过滤器中添加元素 public function add($element) { foreach ($this->hashFunctions as $hashFunction) { $index = $hashFunction($element) % $this->size; $this->filter[$index] = 1; } } // 检查元素是否存在过滤器中 public function isExists($element) { foreach ($this->hashFunctions as $hashFunction) { $index = $hashFunction($element) % $this->size; if ($this->filter[$index] == 0) { return false; } } // 找到了元素的所有哈希值对应的位,但可能是假阳性 return true; } }
Practical case: detecting URL duplication
Goal:Use Bloom filtering The server quickly detects whether a large number of URLs have been crawled.
Implementation:
add()
method for each crawled URL to add it to the filter. isExists()
method to quickly check whether it already exists in the filter. If it exists, the URL is skipped; otherwise, the new URL is added to the filter. Advantages:
The above is the detailed content of PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval. For more information, please follow other related articles on the PHP Chinese website!