Home >Backend Development >PHP Tutorial >PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval

PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval

WBOY
WBOYOriginal
2024-06-01 16:04:05922browse

The Bloom filter is a space-efficient data structure used to determine whether an element belongs to a set. It uses a hash function and a bit array to efficiently find if the element is present, possibly with false positives. It is suitable for scenarios where a large number of elements need to be retrieved quickly, such as URL duplicate detection.

PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval

PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval

Introduction

The Bloom filter is a highly space-efficient data structure used to determine whether an element belongs to a set. It uses a hash function and a bit array to efficiently find whether the element is present.

Principle

The Bloom filter initializes a bit array, and each position is initially 0. Then, the elements are hashed using multiple hash functions, the bit array is indexed with the hash value, and the value at that position is set to 1.

If an element belongs to the set, its hash value will find at least one position in the bit array that is 1. However, it is also possible to find a position of 1 for an element that does not belong to the set, called a false positive.

Implementation

class BloomFilter {

    // 过滤器大小 (位数)
    private $size;

    // 哈希函数个数
    private $numHashes;

    // 哈希函数
    private $hashFunctions;

    // 过滤器位数组
    private $filter;

    // 初始化布隆过滤器
    public function __construct($size, $numHashes) {
        $this->size = $size;
        $this->numHashes = $numHashes;
        $this->initHashFunctions();
        $this->filter = array_fill(0, $this->size, 0);
    }

    // 初始化哈希函数
    private function initHashFunctions() {
        $this->hashFunctions = [];
        for ($i = 0; $i < $this->numHashes; $i++) {
            $this->hashFunctions[] = function ($key) use ($i) {
                return abs(crc32($key . $i));
            };
        }
    }

    // 向过滤器中添加元素
    public function add($element) {
        foreach ($this->hashFunctions as $hashFunction) {
            $index = $hashFunction($element) % $this->size;
            $this->filter[$index] = 1;
        }
    }

    // 检查元素是否存在过滤器中
    public function isExists($element) {
        foreach ($this->hashFunctions as $hashFunction) {
            $index = $hashFunction($element) % $this->size;
            if ($this->filter[$index] == 0) {
                return false;
            }
        }
        // 找到了元素的所有哈希值对应的位,但可能是假阳性
        return true;
    }
}

Practical case: detecting URL duplication

Goal:Use Bloom filtering The server quickly detects whether a large number of URLs have been crawled.

Implementation:

  1. Initialize the Bloom filter, set the size and number of hash functions.
  2. Call the add() method for each crawled URL to add it to the filter.
  3. When encountering a new URL, use the isExists() method to quickly check whether it already exists in the filter. If it exists, the URL is skipped; otherwise, the new URL is added to the filter.

Advantages:

  • Space efficient: Bloom filter size has nothing to do with the number of elements that need to be detected.
  • Fast retrieval: By using hash functions, retrieval operations do not require traversing the entire collection.
  • Acceptable error rate: Bloom filters allow some false positives, but the size and number of hash functions can be adjusted as needed to optimize the error rate.

The above is the detailed content of PHP data structure: clever use of Bloom filters to achieve efficient collection retrieval. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn