Home >Database >Redis >An in-depth analysis of Bloom Filter in Redis

An in-depth analysis of Bloom Filter in Redis

青灯夜游forward: 2021-12-23 09:57:373099browse

How to avoid cache penetration? The following article will take you to learn about the Bloom Filter (Bloom Filter), a powerful tool in Redis to avoid cache penetration. I hope it will be helpful to you!

Overview

Bloom Filter (Bloom Filter) is a data structure, invented by Burton Howard Bloom in 1970 proposed in the year. It's actually a long binary vector and a series of random mapping functions. [Related recommendations: Redis Video Tutorial]

Bloom filters can be used to efficiently retrieve whether an element is in a collection. Its advantage is that space efficiency and query time are far better than ordinary algorithms. Its disadvantage is that it has a certain misrecognition rate and is difficult to delete (generally not supported and requires additional implementation).

The false positive rate means that it can be judged that the element is definitely not in the set, it can be judged that the element may be in the set, but it cannot be judged that the element must be in the set.

The reason why Bloom filter is efficient is that it is a probabilistic data structure that can confirm that the element is definitely not in the set, or the element may be in the set. The reason why it is possible is that it has a certain misrecognition rate, making it impossible to be 100% sure that the element must be in the set.

Question introduction

In daily work, there is a common need to determine whether an element is in a set. For example, the following scenario

Given an IP blacklist database, check whether the specified IP is in the blacklist?
When receiving emails, how to determine whether an email address is spam?
In word processing software, check whether an English word is spelled correctly?

When encountering this kind of problem, our intuition usually tells us that we should use a data structure like a set to implement it. For example, first store all IPs in the IP blacklist database into a set, and then check whether the specified IP exists in the set. If it exists, it means that the IP hits the blacklist.

Use a piece of code to simulate the storage and checking of the IP blacklist library.

public class IPBlackList {

	public static void main(String[] args) {
		Set<String> set = new HashSet<>();
		set.add("192.168.1.1");
		set.add("192.168.1.2");
		set.add("192.168.1.4");
		System.out.println(set.contains("192.168.1.1"));
		System.out.println(set.contains("192.168.1.2"));
		System.out.println(set.contains("192.168.1.3"));
		System.out.println(set.contains("192.168.1.4"));
	}
}

The interior of the collection is usually implemented using a hash table. The advantage is that the query is very efficient, but the disadvantage is that it consumes more storage space.

Generally when the amount of data is relatively small, we will use collections for storage. Trading space for time can improve query efficiency while occupying less space.

However, when the amount of data stored is relatively large, consuming a lot of space will become a problem. Because these data are usually stored in process memory to speed up query efficiency. The memory of the machine is usually limited and must be used as efficiently as possible.

On the other hand, hash tables need to be balanced in terms of space and efficiency. Storing the same number of elements, if the hash table capacity is smaller, the probability of conflict is higher, and more time will be spent to resolve conflicts, thus affecting performance.

The emergence of Bloom Filter (Bloom Filter) can solve this problem very well. On the one hand, it can store data with less memory, and on the other hand, it can achieve very efficient query performance.

Basic Principle

First, create a binary vector and set all bits to 0
Then, select K hash functions are used to hash the elements K times and calculate the bit index of the vector.

Add element

When adding an element to the set, K hash functions are applied to the elements respectively to generate K values as a subscript and sets the corresponding bit of the vector to 1.

Checking elements

If you want to check whether an element exists in the set, use the same hashing method to generate K subscripts and check whether the corresponding bits of the vector are All are 1.

If all are 1, the element is likely to be in the set; otherwise (as long as 1 or more bits are 0), the element is definitely not in the set.

Demo

Suppose there is a Bloom filter with a capacity of 15 bits and 2 hash functions used, as shown below.

|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |0||||||||||||||||||

Add a string a, hash twice to get the following Marked as 4 and 10, mark the bits corresponding to 4 and 10 from 0 to 1.

|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |||||1|||||1|||||||||

Add another string b, hashed 2 times The subscript is obtained as 11, and the bit corresponding to 11 is marked from 0 to 1.

|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |||||1||||||1|1|||||||

Add another string c, hash twice to get the subscripts 11 and 12, and mark the corresponding bits from 0 to 1.

|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |||||1|||||1|1|1||||||

Finally add a string sam, 2 The subscripts 0 and 7 are obtained through the second hash, and the corresponding bits are marked from 0 to 1.

|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |1||||1|||1|||1|1|1|||||||

Above, we added 4 strings, each string was done 2 times In the hash, the corresponding 2 bits are marked as 1, and ultimately there are 6 bits marked as 1 instead of 8 bits.

This shows that the positions obtained by hashing different elements may overlap. The more elements there are, the higher the probability of overlap. If there are two elements in overlapping positions, we cannot judge that either element must be in the set.

If you want to check whether an element exists in the set, you only need to hash it twice in the same way, and search the corresponding bits in the Bloom filter for the two subscripts obtained. If the corresponding 2 bits are not all 1, then the element is definitely not in the set. If the corresponding 2 bits are all 1, it means that the element may be in the set or it may not exist.

For example, check whether the string b exists in the set, and the two subscripts obtained by hashing are both 11. Check and find that the bit corresponding to 11 is 1. However, this does not mean that b must be in the set. This is because the hashed subscript of the string c also contains 11. It is possible that the string c is in the set, but b does not exist. This causes misidentification. Also called a false positive.

Check the string foo again, the subscripts obtained by the hash are 8 and 13 respectively, and the corresponding bits are all 0. Therefore, the string foo is definitely not in the set.

Mathematical Principle

The mathematical principle behind the Bloom filter is

The probability of two completely random numbers colliding is very small, so it can be used in a very small Under the condition of low false recognition rate, a large amount of information is stored in a small amount of space.

Misrecognition rate

The misrecognition rate (FPP, false positive probabilistic) is calculated as follows.

Assume that the Bloom filter size is m bits, n elements are stored, and k times the hash function is used to calculate the storage location.

Add 1 element, then the probability that any bit is 1 is 1/m, and the probability that any bit is 0 is 1-1/m;
Add 1 element and perform k hashes, then the probability that any bit is 0 is (1-1/m)^k, and any bit is The probability of 1 is 1-(1-1/m)^k;
If n elements are added, the probability that any bit is 0 is (1-1 /m)^kn, the probability that any bit is 1 is 1-(1-1/m)^kn;
If a new element is added, In a Bloom filter that already has n elements, the probability that any bit is already 1 is the same as above, and the probability is 1-(1-1/m)^kn. Therefore, the probability that all k bits are 1 is: (1-(1-1/m)^kn)^k, which is the misrecognition rate of newly inserted elements.

When the n value is relatively large, $(1-(1-1/m)^{kn})^k$ is approximately equal to $ (1-e^{-kn/m})^k$

Assume that the bloom filter size m is 16 bits, k is 8, and the storage element n is 1, then the probability of misidentification (false positive) is five out of ten thousand (5/10000).

At this time, m/n=16, in fact this means that 1 element uses 16 bits of space to store.

Therefore, when k is the same, m/n has the same misrecognition rate when it is 16/1, 32/2, 64/4, etc.

WebsiteBloom Filters - the math gives the false recognition rate table of Bloom filters, which can easily check and deal with different m, n, k False recognition rate at a given value.

Solving the false recognition rate

Common methods to solve the false recognition rate include

Whitelist
Backtracking confirmation

Whitelist

A common way to solve the false recognition rate is to establish a smaller whitelist to store those who may Misidentified data.

Take spam filtering as an example. Suppose we have a spam library that filters out spam when receiving emails.

At this time, you can first store this spam library in the Bloom filter. When receiving emails, you can first filter out most normal emails through the Bloom filter efficiently.

For a small number of hit (possibly) spam emails, some of them may be normal emails.

再创建一个白名单库，当在布隆过滤器中判断可能为垃圾邮件时，通过查询白名单来确认是否为正常邮件。

对于没在白名单中的邮件，默认会被移动到垃圾箱。通过人工识别的方式，当发现垃圾箱中存在正常邮件的时候，将其移入白名单。

回源确认

很多时候，使用布隆过滤器是为了低成本，高效率的拦截掉大量数据不在集合中的场景。例如

Google Bigtable，Apache HBase以及Apache Cassandra和PostgreSQL 使用 Bloom 过滤器来减少对不存在的行或列的磁盘查找。避免进行昂贵的磁盘查找，可大大提高数据库查询操作的性能。
在谷歌浏览器用于使用布隆过滤器来识别恶意URL的网页浏览器。首先会针对本地 Bloom 过滤器检查所有 URL，只有在 Bloom 过滤器返回肯定结果的情况下，才对执行的 URL 进行全面检查（如果该结果也返回肯定结果，则用户会发出警告）。
拦截掉大量非IP黑名单请求，对于少量可能在黑名单中的IP，再查询一次黑名单库。

这是布隆过滤器非常典型的应用场景，先过滤掉大部分请求，然后只处理少量不明确的请求。

这个方法，和白名单库的区别是，不需要再另外建立一套库来处理，而是使用本来就已经存在的数据和逻辑。

例如 Google Bigtable 查询数据行本来就是需要查的，只不过使用布隆过滤器拦截掉了大部分不必要的请求。而 IP 是否为黑名单也是需要查询的，同样是先使用布隆过滤器来拦截掉大部分IP。

而上面垃圾邮件的处理，对于可能为垃圾邮件的情况，不是通过完整的垃圾邮件库再查询一次进行确认，而是用增加白名单来进行判断的方式。因为通常来说，白名单库会更小，便于缓存。

这里所说的回源，实际上是对可能被误识别的请求，最后要回到数据源头或逻辑确认一次。

Guava中的布隆容器的实现

Guava 中就提供了一种 Bloom Filter 的实现。

guava包引入

要使用 Bloom Filter，需要引入 guava 包

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>23.0</version>
</dependency>

误判率测试

下面对布隆容器的误判率进行测试，分2步

往过滤器中放一百万个数，然后去验证这一百万个数是否能通过过滤器
另外找一万个数，去检验漏网之鱼的数量

/**
 * 测试布隆过滤器(可用于redis缓存穿透)
 * 
 * @author 敖丙
 */
public class TestBloomFilter {

    private static int total = 1000000;
    private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total);
//    private static BloomFilter<Integer> bf = BloomFilter.create(Funnels.integerFunnel(), total, 0.001);

    public static void main(String[] args) {
        // 初始化1000000条数据到过滤器中
        for (int i = 0; i < total; i++) {
            bf.put(i);
        }

        // 匹配已在过滤器中的值，是否有匹配不上的
        for (int i = 0; i < total; i++) {
            if (!bf.mightContain(i)) {
                System.out.println("有坏人逃脱了~~~");
            }
        }

        // 匹配不在过滤器中的10000个值，有多少匹配出来
        int count = 0;
        for (int i = total; i < total + 10000; i++) {
            if (bf.mightContain(i)) {
                count++;
            }
        }
        System.out.println("误伤的数量：" + count);
    }

}

运行结果

误伤的数量：320

运行结果表示，遍历这一百万个在过滤器中的数时，都被识别出来了。一万个不在过滤器中的数，误伤了320个，错误率是0.03左右。

Bloom Filter 源码分析

public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions) {
        return create(funnel, (long) expectedInsertions);
    }  

    public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
        return create(funnel, expectedInsertions, 0.03); // FYI, for 3%, we always get 5 hash functions
    }

    public static <T> BloomFilter<T> create(
          Funnel<? super T> funnel, long expectedInsertions, double fpp) {
        return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);
    }

    static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
     ......
    }

Bloom Filter 一共四个 create 方法，不过最终都是走向第4个。看一下每个参数的含义

funnel：数据类型(一般是调用Funnels工具类中的)
expectedInsertions：期望插入的值的个数
fpp：错误率(默认值为0.03)
strategy：哈希算法 Bloom Filter的应用

更多编程相关知识，请访问：编程视频！！

The above is the detailed content of An in-depth analysis of Bloom Filter in Redis. For more information, please follow other related articles on the PHP Chinese website!

Statement：

This article is reproduced at:juejin.cn. If there is any infringement, please contact admin@php.cn delete

Previous article：In-depth understanding of special data types in Redis: cardinality statistics, bitmaps, geographical locationNext article：In-depth understanding of special data types in Redis: cardinality statistics, bitmaps, geographical location

See more

An in-depth analysis of Bloom Filter in Redis

Overview

Question introduction

Basic Principle

Add element

Checking elements

Demo

Mathematical Principle

Misrecognition rate

Solving the false recognition rate

Whitelist

回源确认

Guava中的布隆容器的实现

guava包引入

误判率测试

Bloom Filter 源码分析

Related articles