How to use Redis's Bloomfilter to remove duplicates during the crawler process-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

How to use Redis's Bloomfilter to remove duplicates during the crawler process

坏嘻嘻

Sep 15, 2018 am 11:21 AM

The content of this article is about how to use Redis's Bloomfilter to remove duplicates. It not only uses Bloomfilter's massive duplicate removal capabilities, but also uses Redis's persistence capabilities. It has certain reference value. Friends in need can refer to it, I hope it will be helpful to you.

Foreword:

"Removal" is a skill that is often used in daily work. It is even more commonly used in the crawler field and is of average scale. All are relatively large. Two points need to be considered for deduplication: the amount of data to be deduplicated and the speed of deduplication. In order to maintain a fast deduplication speed, deduplication is generally performed in memory.

When the amount of data is not large, it can be placed directly in the memory for deduplication. For example, python can use set() for deduplication.
When deduplication data needs to be persisted, the set data structure of redis can be used.
When the amount of data is larger, you can use different encryption algorithms to compress the long string into 16/32/40 characters, and then use the above two methods to remove duplicates;
When the amount of data reaches the order of hundreds of millions (or even billions or tens of billions), the memory is limited, and "bits" must be used to remove duplicates to meet the demand. Bloomfilter maps deduplication objects to several memory "bits" and uses the 0/1 values of several bits to determine whether an object already exists.
However, Bloomfilter runs on the memory of a machine, which is not convenient for persistence (there will be nothing if the machine is down), and it is not convenient for unified deduplication of distributed crawlers. If you can apply for memory on Redis for Bloomfilter, both of the above problems will be solved.

Code:

# encoding=utf-8import redisfrom hashlib import md5class SimpleHash(object):
    def __init__(self, cap, seed):
        self.cap = cap
        self.seed = seed    def hash(self, value):
        ret = 0
        for i in range(len(value)):
            ret += self.seed * ret + ord(value[i])        return (self.cap - 1) & retclass BloomFilter(object):
    def __init__(self, host=&#39;localhost&#39;, port=6379, db=0, blockNum=1, key=&#39;bloomfilter&#39;):
        """
        :param host: the host of Redis
        :param port: the port of Redis
        :param db: witch db in Redis
        :param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it.
        :param key: the key&#39;s name in Redis
        """
        self.server = redis.Redis(host=host, port=port, db=db)
        self.bit_size = 1 << 31  # Redis的String类型最大容量为512M，现使用256M
        self.seeds = [5, 7, 11, 13, 31, 37, 61]
        self.key = key
        self.blockNum = blockNum
        self.hashfunc = []        for seed in self.seeds:
            self.hashfunc.append(SimpleHash(self.bit_size, seed))    def isContains(self, str_input):
        if not str_input:            return False
        m5 = md5()
        m5.update(str_input)
        str_input = m5.hexdigest()
        ret = True
        name = self.key + str(int(str_input[0:2], 16) % self.blockNum)        for f in self.hashfunc:
            loc = f.hash(str_input)
            ret = ret & self.server.getbit(name, loc)        return ret    def insert(self, str_input):
        m5 = md5()
        m5.update(str_input)
        str_input = m5.hexdigest()
        name = self.key + str(int(str_input[0:2], 16) % self.blockNum)        for f in self.hashfunc:
            loc = f.hash(str_input)
            self.server.setbit(name, loc, 1)if __name__ == &#39;__main__&#39;:""" 第一次运行时会显示 not exists!，之后再运行会显示 exists! """
    bf = BloomFilter()    if bf.isContains(&#39;http://www.baidu.com&#39;):   # 判断字符串是否存在
        print &#39;exists!&#39;
    else:        print &#39;not exists!&#39;
        bf.insert(&#39;http://www.baidu.com&#39;)

Description:

How is Bloomfilter algorithm There are many explanations on Baidu about using bit deduplication. To put it simply, there are several seeds. Now apply for a section of memory space. A seed can be hashed with a string and mapped to a bit on this memory. If several bits are 1, it means that the string already exists. The same is true when inserting, setting all mapped bits to 1.
It should be reminded that the Bloomfilter algorithm has a missing probability, that is, there is a certain probability that a non-existent string will be misjudged as already existing. The size of this probability is related to the number of seeds, the memory size requested, and the number of deduplication objects. There is a table below, m represents the memory size (how many bits), n represents the number of deduplication objects, and k represents the number of seeds. For example, I applied for 256M in my code, which is 1
Bloomfilter deduplication based on Redis actually uses the String data structure of Redis, but a Redis String can only be up to 512M, so if the deduplication data The volume is large and you need to apply for multiple deduplication blocks (blockNum in the code represents the number of deduplication blocks).
The code uses MD5 encryption and compression to compress the string to 32 characters (hashlib.sha1() can also be used to compress it to 40 characters). It has two functions. First, Bloomfilter will make errors when hashing a very long string, often misjudging it as already existing. This problem no longer exists after compression; second, the compressed characters are 0~f. There are 16 possibilities in total. I intercepted the first two characters, and then assigned the string to different deduplication blocks based on blockNum for deduplication.

Summary:

Bloomfilter deduplication based on Redis uses both Bloomfilter's massive deduplication capabilities and Redis's Persistence capability, based on Redis, also facilitates deduplication of distributed machines. During use, it is necessary to budget the amount of data to be deduplicated, and appropriately adjust the number of seeds and blockNum according to the above table (the fewer seeds, the faster the deduplication will be, but the greater the leakage rate).

The above is the detailed content of How to use Redis's Bloomfilter to remove duplicates during the crawler process. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What are the advantages of using a database to store sessions?Apr 24, 2025 am 12:16 AM

The main advantages of using database storage sessions include persistence, scalability, and security. 1. Persistence: Even if the server restarts, the session data can remain unchanged. 2. Scalability: Applicable to distributed systems, ensuring that session data is synchronized between multiple servers. 3. Security: The database provides encrypted storage to protect sensitive information.

How do you implement custom session handling in PHP?Apr 24, 2025 am 12:16 AM

Implementing custom session processing in PHP can be done by implementing the SessionHandlerInterface interface. The specific steps include: 1) Creating a class that implements SessionHandlerInterface, such as CustomSessionHandler; 2) Rewriting methods in the interface (such as open, close, read, write, destroy, gc) to define the life cycle and storage method of session data; 3) Register a custom session processor in a PHP script and start the session. This allows data to be stored in media such as MySQL and Redis to improve performance, security and scalability.

What is a session ID?Apr 24, 2025 am 12:13 AM

SessionID is a mechanism used in web applications to track user session status. 1. It is a randomly generated string used to maintain user's identity information during multiple interactions between the user and the server. 2. The server generates and sends it to the client through cookies or URL parameters to help identify and associate these requests in multiple requests of the user. 3. Generation usually uses random algorithms to ensure uniqueness and unpredictability. 4. In actual development, in-memory databases such as Redis can be used to store session data to improve performance and security.

How do you handle sessions in a stateless environment (e.g., API)?Apr 24, 2025 am 12:12 AM

Managing sessions in stateless environments such as APIs can be achieved by using JWT or cookies. 1. JWT is suitable for statelessness and scalability, but it is large in size when it comes to big data. 2.Cookies are more traditional and easy to implement, but they need to be configured with caution to ensure security.

How can you protect against Cross-Site Scripting (XSS) attacks related to sessions?Apr 23, 2025 am 12:16 AM

To protect the application from session-related XSS attacks, the following measures are required: 1. Set the HttpOnly and Secure flags to protect the session cookies. 2. Export codes for all user inputs. 3. Implement content security policy (CSP) to limit script sources. Through these policies, session-related XSS attacks can be effectively protected and user data can be ensured.

How can you optimize PHP session performance?Apr 23, 2025 am 12:13 AM

Methods to optimize PHP session performance include: 1. Delay session start, 2. Use database to store sessions, 3. Compress session data, 4. Manage session life cycle, and 5. Implement session sharing. These strategies can significantly improve the efficiency of applications in high concurrency environments.

What is the session.gc_maxlifetime configuration setting?Apr 23, 2025 am 12:10 AM

Thesession.gc_maxlifetimesettinginPHPdeterminesthelifespanofsessiondata,setinseconds.1)It'sconfiguredinphp.iniorviaini_set().2)Abalanceisneededtoavoidperformanceissuesandunexpectedlogouts.3)PHP'sgarbagecollectionisprobabilistic,influencedbygc_probabi

How do you configure the session name in PHP?Apr 23, 2025 am 12:08 AM

In PHP, you can use the session_name() function to configure the session name. The specific steps are as follows: 1. Use the session_name() function to set the session name, such as session_name("my_session"). 2. After setting the session name, call session_start() to start the session. Configuring session names can avoid session data conflicts between multiple applications and enhance security, but pay attention to the uniqueness, security, length and setting timing of session names.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

4 weeks agoByDDD

Atomfall guide: item locations, quest guides, and tips

1 months agoByDDD

Hot Tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),