How to implement a more comprehensive Golang version of the cuckoo filter-Golang-php.cn

Home

Backend Development

Golang

How to implement a more comprehensive Golang version of the cuckoo filter

藏色散人

Mar 11, 2021 am 11:23 AM

golang

The following is introduced to you by the go language tutorial column. I have implemented a more comprehensive Golang version of the cuckoo filter. I hope it will be helpful to friends in need. !

"Determining whether a value is in a huge set" (hereinafter collectively referred to as set membership testing) is a common data processing problem. In past experience, if a certain false positive rate is allowed, Bloom filters are the first choice, but now we have a better choice: cuckoo filters.
Recent business requires the use of filters. After searching, I found that the cuckoo filter is more cost-effective and better than the Bloom filter in our scenario.
In order to determine the final technology selection, I read the original paper. Later, when I decided to use the cuckoo filter, I found that there were almost no comprehensive implementations of golang. Currently, several high-star implementations on GitHub have some flaws. , and did not maximize space utilization, so I transplanted and optimized a version of the Golang library with reference to the original paper and the original C implementation of the paper. The details are below.
The code address is here, welcome to star, use, contribute and debug: github.com/linvon/cuckoo-filter

cuckoo filter

cuckoo There are already many introductory articles on filters on the Internet. I won’t go into too much introduction here. I will only mention the key points to lead to the following content

If you want to know more details, you can refer to Original paper, or check out my Chinese translated version

What is a cuckoo filter?

is a filter implemented based on the cuckoo hash algorithm. It is essentially a cuckoo hash table that stores the hash value of the storage item.

If you understand Bloom filters, you should know that the principle of Bloom filters is to use multiple hashing methods to map different hashes of storage items to bit arrays, and check these bits during querying to determine whether it exists.

The cuckoo filter hashes the storage item, takes a certain number of digits from its hash value and stores it in the array. When querying, it determines whether the hash with equal digits is in the array. exist.

Why choose cuckoo filter?

They also store hash values, essentially storing multi-bit hashes. Why is the cuckoo filter better?

First, because the cuckoo hash table is more compact, it can save more space.
The second reason is that when querying, the Bloom filter uses a variety of hash functions for multiple hashes, while the cuckoo filter only needs one hash, so the query efficiency Very high
Third, the cuckoo filter supports deletion, while the Bloom filter does not support deletion

The advantages are there, but what are the disadvantages? Compared with the Bloom filter

, the cuckoo filter adopts a backup candidate bucket scheme. The candidate bucket and the preferred bucket can be obtained by XORing each other through the position and storage value. This correspondence relationship It is required that the size of the bucket must be an exponential multiple of 2
When the Bloom filter is inserted, the hash is calculated and the bit is written directly, while the cuckoo filter may appear that the current position has been stored after calculation. Fingerprint, at this time, it is necessary to kick the stored items into the candidate bucket. As the bucket becomes fuller and fuller, the possibility of conflict becomes greater and greater, and the insertion time becomes higher and higher. Therefore, its insertion performance is compared with Bloom filtering. The filter is very poor
Inserting duplicate elements: The Bloom filter has no effect when inserting duplicate elements, it just resets the existing bits. The cuckoo filter will kick out existing values, so there is an upper limit for the insertion of repeated elements.
The deletion of the cuckoo filter is not perfect: there are the above restrictions on repeated insertion, and it will also be deleted when deleting. A related problem arises: deletion is only perfect when the same hash value is inserted once. If the element is deleted without being inserted, accidental deletion may occur, which is the same reason for the false positive rate; if the element is inserted multiple times, Then each deletion will only delete one value. You need to know how many times the element has been inserted before it can be deleted, or run the deletion in a loop until the deletion fails.

The advantages and disadvantages are all listed, let’s summarize them again . For this kind of set membership test problem, most scenarios involve more reading and less writing, and repeated insertions are meaningless. Although the deletion of the cuckoo filter is not perfect, it is better than nothing. There are also better queries and storage. Efficiency, it should be said that in most cases it is a more cost-effective choice.

Practical Guide

Detailed Implementation

Let’s talk about the concept of cuckoo filter first. The filter is composed of many buckets. , each bucket stores the hashed value of the inserted item, which only stores a fixed number of digits.

There are n buckets in the filter, and the number of buckets is calculated based on the number of items to be stored. Through the hash algorithm, we can calculate which bucket an item should be stored in. In addition, each additional hash algorithm can generate a candidate bucket for an item. When repeated insertions are made, the currently stored item will be kicked into the candidate bucket. Go in. Theoretically, the more hash algorithms, the higher the space utilization, but in actual testing, k=2 hash functions were used to achieve a utilization rate of 98%.

Each bucket will store multiple fingerprints. This is subject to the size of the bucket. Different fingerprints may be mapped to the same bucket. The larger the bucket, the higher the space utilization, but at the same time, the more fingerprints are scanned in the same bucket for each query, so the probability of generating false positives is higher. At this time, it is necessary to increase the number of stored fingerprints to reduce the conflict rate. Maintain false positive rate.

In the paper, several parameters required to implement the cuckoo filter are mentioned, mainly

The number of hash functions (k): the number of hashes, take 2 It’s enough
Bucket size (b): How many fingerprints are stored in each bucket
Fingerprint size (f): How many bits of the hash value of each fingerprint storage key

Read the paper in detail. In Chapter 5, the author relies on experimental data to tell us how to choose the most appropriate construction parameters. We can get the following conclusion

The filter cannot be filled 100%, There is a maximum load factor α, then the storage space allocated to each item is f/α
When the total size of the filter is kept constant, the larger the bucket, the higher the load factor, that is, the higher the space utilization. High, but the more fingerprints stored in each bucket, the higher the probability of conflicts during query. In order to maintain the same false positive rate, the larger the bucket, the larger the fingerprints required

Based on the above theoretical basis, the relevant experimental data obtained are:

When using k=2 hash functions, when the bucket size b=1 (that is, direct mapping of the hash table), the load The factor α is 50%, but when using bucket size b=2, 4 or 8, it will increase to 84%, 95% and 98% respectively
In order to ensure the false positive rate r, it is necessary to ensure $2b/2 ^f\leq r$ , then the size of fingerprint f is approximately $f ≥ log_2(2b/r)=log_2(1/r) log_2(2b)$ , then the amortized cost of each item is $C ≤ [log_2( 1/r) log_2(2b)]/α$
The experimental data shows that when r>0.002. Two entries per bucket produces slightly better results than using four entries per bucket; four entries per bucket minimizes space when r is reduced to 0.00001
If using Semi-sorted bucket can reduce 1 bit of storage space for each storage item, but it only acts on filters with b=4

#In this way we can determine how to choose parameters. Constructing our cuckoo filter:

First we use two hash functions, which is enough, which can achieve sufficient space utilization. Depending on the false positive rate we need, we can determine what bucket size to use, of course the choice of b is not absolute, even if r>0.002, you can use b=4 to enable semi-sorted buckets. We can then calculate the size of f we need to achieve the target false positive rate based on b, so that all filter parameters are determined.

Comparing the above conclusion with $1.44log_2(1/r)$ for each item of the Bloom filter, we can find that when semi-sorting is enabled, when r

Some advanced explanations

Optimization of hash algorithm

Although we specified that two hash algorithms are required, But in actual implementation, it is enough for us to use a hash algorithm, because an alternative bucket calculation method is mentioned in the paper. The second hash value can be XORed by the first hash value and the fingerprint stored at that location. Calculated. If you are worried that we still need to calculate the hash of the fingerprint and the hash of the location separately, we can just use one algorithm to make a 64-bit hash, with the high 32 bits used to calculate the location and the low 32 bits used to calculate the fingerprint.

Why can semi-sorted buckets only be used when b=4?

The essence of half sorting is to take four digits of each fingerprint. The four digits can be expressed as a hexadecimal number. The four-digit storage of b fingerprints can be expressed as b 16 After arranging all possible base numbers in order, the corresponding arrangement can be found by indexing their positions to obtain the actual stored value.

We can calculate the number of all situation types through the following function

func getNum(base, k, b, f int, cnt *int) {
    for i := base; i > 1
    n |= n >> 2
    n |= n >> 4
    n |= n >> 8
    n |= n >> 16
    n |= n >> 32
    n++
    return uint(n)}func getNumOfKindAndBit(b, f int) {
    cnt := 0
    getNum(0, 0, b, f, &cnt)
    fmt.Printf("Num of kinds: %v, Num of needed bits: %v\n", cnt, math.Log2(float64(getNextPow2(uint64(cnt)))))}

When b=4, there are a total of 3786 permutations, which is less than 4096, that is, 12 bits can be used to store all permutation indexes , and if all fingerprints are stored directly, 4X4=16 bits are needed, which saves 4 bits, that is, one bit is saved for each fingerprint.

It can be found that when b is 2, whether to enable half sorting requires the same number of stored digits, which is meaningless. If b is too large, the index that needs to be stored will also expand rapidly, which will cause a great loss in query performance. Therefore, b=4 is the most cost-effective option.

In addition, the choice of encoding to store four-digit fingerprints is because it can be represented by a hexadecimal system, which is convenient for storage

Parameter selection when using half sorting

When using half sorting, you should ensure that $ceil(b(f-1)/8)f/8)$, otherwise the space occupied will be the same whether you use half sorting or not.

Filter size selection

The total bucket size of the filter must be an exponential multiple of 2, so when setting the filter size, try to satisfy $size/α ~=(

Golang implementation

This part is mainly related to the Golang library

After looking through the golang implementation of cuckoofilter on Github, I found that the existing implementations have some shortcomings:

Most libraries have fixed b and f, that is, the false positive rate is also fixed, and the adaptability is not good
All libraries f are in bytes, only It can be adjusted in multiples of 8, and it is inconvenient to adjust the false positive rate
All libraries do not implement semi-sorted buckets, which greatly reduces the advantages compared to Bloom filters

Because my own scenario requires better space and a custom false positive rate, I transplanted the C implementation of the original paper and made some optimizations, mainly including

Support adjustment parameters
Support semi-sorted buckets
Compress space into compact bit array, store fingerprints bitwise
Support binary serialization

The above is the detailed content of How to implement a more comprehensive Golang version of the cuckoo filter. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:learnku. If there is any infringement, please contact admin@php.cn delete

Choosing Between Golang and Python: The Right Fit for Your ProjectApr 19, 2025 am 12:21 AM

Golangisidealforperformance-criticalapplicationsandconcurrentprogramming,whilePythonexcelsindatascience,rapidprototyping,andversatility.1)Forhigh-performanceneeds,chooseGolangduetoitsefficiencyandconcurrencyfeatures.2)Fordata-drivenprojects,Pythonisp

Golang: Concurrency and Performance in ActionApr 19, 2025 am 12:20 AM

Golang achieves efficient concurrency through goroutine and channel: 1.goroutine is a lightweight thread, started with the go keyword; 2.channel is used for secure communication between goroutines to avoid race conditions; 3. The usage example shows basic and advanced usage; 4. Common errors include deadlocks and data competition, which can be detected by gorun-race; 5. Performance optimization suggests reducing the use of channel, reasonably setting the number of goroutines, and using sync.Pool to manage memory.

Golang vs. Python: Which Language Should You Learn?Apr 19, 2025 am 12:20 AM

Golang is more suitable for system programming and high concurrency applications, while Python is more suitable for data science and rapid development. 1) Golang is developed by Google, statically typing, emphasizing simplicity and efficiency, and is suitable for high concurrency scenarios. 2) Python is created by Guidovan Rossum, dynamically typed, concise syntax, wide application, suitable for beginners and data processing.

Golang vs. Python: Performance and ScalabilityApr 19, 2025 am 12:18 AM

Golang is better than Python in terms of performance and scalability. 1) Golang's compilation-type characteristics and efficient concurrency model make it perform well in high concurrency scenarios. 2) Python, as an interpreted language, executes slowly, but can optimize performance through tools such as Cython.

Golang vs. Other Languages: A ComparisonApr 19, 2025 am 12:11 AM

Go language has unique advantages in concurrent programming, performance, learning curve, etc.: 1. Concurrent programming is realized through goroutine and channel, which is lightweight and efficient. 2. The compilation speed is fast and the operation performance is close to that of C language. 3. The grammar is concise, the learning curve is smooth, and the ecosystem is rich.

Golang and Python: Understanding the DifferencesApr 18, 2025 am 12:21 AM

The main differences between Golang and Python are concurrency models, type systems, performance and execution speed. 1. Golang uses the CSP model, which is suitable for high concurrent tasks; Python relies on multi-threading and GIL, which is suitable for I/O-intensive tasks. 2. Golang is a static type, and Python is a dynamic type. 3. Golang compiled language execution speed is fast, and Python interpreted language development is fast.

Golang vs. C : Assessing the Speed DifferenceApr 18, 2025 am 12:20 AM

Golang is usually slower than C, but Golang has more advantages in concurrent programming and development efficiency: 1) Golang's garbage collection and concurrency model makes it perform well in high concurrency scenarios; 2) C obtains higher performance through manual memory management and hardware optimization, but has higher development complexity.

Golang: A Key Language for Cloud Computing and DevOpsApr 18, 2025 am 12:18 AM

Golang is widely used in cloud computing and DevOps, and its advantages lie in simplicity, efficiency and concurrent programming capabilities. 1) In cloud computing, Golang efficiently handles concurrent requests through goroutine and channel mechanisms. 2) In DevOps, Golang's fast compilation and cross-platform features make it the first choice for automation tools.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Saving in R.E.P.O. Explained (And Save Files)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks agoByDDD

Hot Tools

SublimeText3 Linux new version

SublimeText3 Linux latest version

Dreamweaver Mac version

Visual web development tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7572

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

110