1. Introduction
Client: Does this key exist?
Server: Does not exist/don’t know
The Bloom filter is a relatively clever probabilistic data structure, and its essence is a data structure. It features efficient insertion and querying. But when we want to check whether a key exists in a certain structure, by using a Bloom filter, we can quickly learn that "this key must not exist or may exist." Compared with traditional data structures such as List, Set, and Map, it is more efficient and takes up less space, but the results it returns are probabilistic and inaccurate.
Bloom filters are only used to test membership in a collection. The classic Bloom filter example is to improve efficiency by reducing expensive disk (or network) lookups for non-existent keys. As we can see, a Bloom filter can search for a key in O(k) constant time, where k is the number of hash functions, and testing for the non-existence of a key will be very fast.
2. Application scenarios
2.1 Cache penetration
In order to improve access efficiency, we will put some data in the Redis cache. When performing data query, you can first obtain the data from the cache without reading the database. This can effectively improve performance.
When querying data, first determine whether there is data in the cache. If there is data, obtain the data directly from the cache.
But if there is no data, you need to get the data from the database and then put it into the cache. If a large number of accesses fail to hit the cache, it will put a lot of pressure on the database, causing the database to crash. Using Bloom filters, when accessing a non-existent cache, you can quickly return to avoid cache or DB crash.
2.2 Determine whether a certain data exists in massive data
HBase stores a very large amount of data. To determine whether a certain ROWKEYS or a certain column exists, use a Bloom filter. You can quickly get whether a certain data exists. But there is a certain misjudgment rate. But if a key does not exist, it must be accurate.
3. Problems with HashMap
To determine whether an element exists, the efficiency of using HashMap is very high. HashMap can achieve O(1) constant time complexity by mapping values to HashMap Keys.
However, if the amount of data stored is very large (for example: hundreds of millions of data), HashMap will consume a very large amount of memory. And it is simply impossible to read massive amounts of data into memory at once.
4. Understand the working principle diagram of Bloom filter
:
The Bloom filter is a bit array or a bit binary vector
The elements in this array are either 0 or 1
k hash functions are independent of each other, and the calculated result of each hash function is modulo the length m of the array , and set the corresponding bit to 1 (blue cell)
We set each key to the cell in this way, which is the "Bloom filter"
5. According to the cloth Long filter query element
Assume that a key is entered, we use the previous k hash functions to find the hash, and get k values
Determine whether the k values are all blue, if one is not Blue, then the key must not exist
If both are blue, then the key may exist (Bloom filter will cause misjudgment)
Because if there are many input objects and the set is relatively small, it will As a result, most positions in the collection will be painted blue. Then when a certain key is checked to be blue, a certain position happens to be set to blue. At this time, it will be mistakenly believed that the key is in the collection.
Example:
6. Can it be deleted?
Traditional bloom filters do not support deletion operations. However, a variant called Counting Bloom filter can be used to test whether the number of element counts is absolutely less than a certain threshold, and it supports element deletion. The principle and implementation of the article Counting Bloom Filter is written in great detail and you can read it in detail.
7. How to choose the number of hash functions and the length of the Bloom filter
Obviously, if the Bloom filter is too small, all bits will soon be 1, then any value can be queried All will return "may exist", which defeats the purpose of filtering. As the length of a Bloom filter increases, its false positive rate decreases.
In addition, the number of hash functions also needs to be weighed. The more the number, the faster the Bloom filter bit position is set to 1, and the lower the efficiency of the Bloom filter; but if there are too few If so, our false alarm rate will become higher.
As can be seen from the above figure, increasing the number of hash functions k will greatly reduce the error rate p.
Don’t worry, actually we need to confirm the values of m and k. Then, if we specify the fault tolerance p and the number of elements n, these parameters can be calculated using the following formula:
We can calculate these parameters based on the size of the filter m, the number of hash functions k and the number of inserted elements n To calculate the false alarm rate p, the formula is as follows: Based on the above, how to choose the k and m values suitable for the business?
Formula:
k is the number of hash functions, m is the Bloom filter length, n is the number of inserted elements, and p is the false positive rate.
As for how to derive this formula, I have published an article on Zhihu about it. If you are interested, you can read it. If you are not interested, just remember the formula above.
I would also like to mention another important point here. Since the only purpose of using a Bloom filter is to search faster, we can't use a slow hash function, right? Cryptographic hash functions (e.g. Sha-1, MD5) are not a good choice for bloom filters because they are a bit slow. So, better choices from faster hash function implementations are murmur, fnv family hashing, Jenkins hashing and HashMix.
More Application Scenarios
In the given example you have seen that we can use this to warn the user for entering a weak password.
You can use bloom filters to prevent users from visiting malicious websites.
Instead of querying a SQL database to check if a user with a specific email exists, you can first use the Bloom Bloom filter to do a cheap lookup check. If the email doesn't exist, great! If it does exist, you may have to make additional queries to the database. You can also do the same thing to search for "username already taken."
You can keep a Bloom filter based on the IP address of your website visitor to check whether the user of your website is a "returning user" or a "new user". A few false positives from “returning users” can’t hurt you, right?
You can also do spell checking by tracking dictionary words using Bloom filters.
The above is the detailed content of What is the algorithm formula for Redis bloom filter size?. For more information, please follow other related articles on the PHP Chinese website!

Redisactsasbothadatastoreandaservice.1)Asadatastore,itusesin-memorystorageforfastoperations,supportingvariousdatastructureslikekey-valuepairsandsortedsets.2)Asaservice,itprovidesfunctionalitieslikepub/submessagingandLuascriptingforcomplexoperationsan

Compared with other databases, Redis has the following unique advantages: 1) extremely fast speed, and read and write operations are usually at the microsecond level; 2) supports rich data structures and operations; 3) flexible usage scenarios such as caches, counters and publish subscriptions. When choosing Redis or other databases, it depends on the specific needs and scenarios. Redis performs well in high-performance and low-latency applications.

Redis plays a key role in data storage and management, and has become the core of modern applications through its multiple data structures and persistence mechanisms. 1) Redis supports data structures such as strings, lists, collections, ordered collections and hash tables, and is suitable for cache and complex business logic. 2) Through two persistence methods, RDB and AOF, Redis ensures reliable storage and rapid recovery of data.

Redis is a NoSQL database suitable for efficient storage and access of large-scale data. 1.Redis is an open source memory data structure storage system that supports multiple data structures. 2. It provides extremely fast read and write speeds, suitable for caching, session management, etc. 3.Redis supports persistence and ensures data security through RDB and AOF. 4. Usage examples include basic key-value pair operations and advanced collection deduplication functions. 5. Common errors include connection problems, data type mismatch and memory overflow, so you need to pay attention to debugging. 6. Performance optimization suggestions include selecting the appropriate data structure and setting up memory elimination strategies.

The applications of Redis in the real world include: 1. As a cache system, accelerate database query, 2. To store the session data of web applications, 3. To implement real-time rankings, 4. To simplify message delivery as a message queue. Redis's versatility and high performance make it shine in these scenarios.

Redis stands out because of its high speed, versatility and rich data structure. 1) Redis supports data structures such as strings, lists, collections, hashs and ordered collections. 2) It stores data through memory and supports RDB and AOF persistence. 3) Starting from Redis 6.0, multi-threaded I/O operations have been introduced, which has improved performance in high concurrency scenarios.

RedisisclassifiedasaNoSQLdatabasebecauseitusesakey-valuedatamodelinsteadofthetraditionalrelationaldatabasemodel.Itoffersspeedandflexibility,makingitidealforreal-timeapplicationsandcaching,butitmaynotbesuitableforscenariosrequiringstrictdataintegrityo

Redis improves application performance and scalability by caching data, implementing distributed locking and data persistence. 1) Cache data: Use Redis to cache frequently accessed data to improve data access speed. 2) Distributed lock: Use Redis to implement distributed locks to ensure the security of operation in a distributed environment. 3) Data persistence: Ensure data security through RDB and AOF mechanisms to prevent data loss.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Chinese version
Chinese version, very easy to use

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.