Python crawler agent IP pool implementation method-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Python crawler agent IP pool implementation method

高洛峰

Feb 11, 2017 pm 01:09 PM

python crawlerproxy ip pool

Working as a distributed deep web crawler in the company, we have built a stable proxy pool service to provide effective proxies for thousands of crawlers, ensuring that each crawler gets a valid proxy IP for the corresponding website, thereby ensuring that the crawler It runs quickly and stably, so I want to use some free resources to build a simple proxy pool service.

Working as a distributed deep web crawler in the company, we have built a stable proxy pool service to provide effective proxies for thousands of crawlers, ensuring that each crawler gets a valid proxy IP for the corresponding website. , thereby ensuring the fast and stable operation of the crawler. Of course, things done in the company cannot be open sourced. However, I feel itchy in my free time, so I want to use some free resources to build a simple proxy pool service.

1. Question

Where does the proxy IP come from?
When I first learned crawling by myself, I didn’t have a proxy IP, so I went to websites with free proxies such as Xiqi Proxy and Express Proxy to crawl. There were still some proxies that could be used. Of course, if you have a better proxy interface, you can also connect it yourself. The collection of free agents is also very simple. It is nothing more than: visit the page page —> Regular/xpath extraction —> Save

How to ensure the quality of the agent?
It is certain that most of the free proxy IPs cannot be used, otherwise why would others provide paid IPs (but the fact is that the paid IPs of many agents are not stable, and many of them cannot be used). Therefore, the collected proxy IP cannot be used directly. You can write a detection program to continuously use these proxies to access a stable website to see if it can be used normally. This process can be multi-threaded or asynchronous, since detecting proxies is a slow process.

How to store the collected agents?
Here I have to recommend a high-performance NoSQL database SSDB that supports multiple data structures for proxying Redis. Supports queue, hash, set, k-v pairs, and T-level data. It is a very good intermediate storage tool for distributed crawlers.

How to make it easier for crawlers to use these proxies?
The answer is definitely to make it a service. Python has so many web frameworks. Just pick one and write an API for the crawler to call. This has many benefits. For example, when the crawler finds that the agent cannot be used, it can actively delete the agent IP through the API. When the crawler finds that the agent pool IP is not enough, it can actively refresh the agent pool. This is more reliable than the detection program.

2. Proxy pool design

The proxy pool consists of four parts:

ProxyGetter:
Proxy acquisition interface, currently there are 5 free ones Proxy source, every time it is called, the latest proxies of these 5 websites will be captured and put into the DB. You can add additional proxy acquisition interfaces by yourself;

DB:
is used to store the proxy IP. Currently, it only Support SSDB. As for why you chose SSDB, you can refer to this article. I personally think SSDB is a good Redis alternative. If you have not used SSDB, it is very simple to install. You can refer to here;

Schedule:
Scheduled task users regularly check the agent availability in the DB and delete unavailable agents. At the same time, it will also take the initiative to get the latest proxy through ProxyGetter and put it into the DB;

ProxyApi:
The external interface of the proxy pool. Since the proxy pool function is relatively simple now, I spent two hours looking at Flask. I was happy. The decision was made with Flask. The function is to provide get/delete/refresh and other interfaces for crawlers to facilitate direct use by crawlers.

Python crawler agent IP pool implementation method

[HTML_REMOVED] Design

3. Code module

High-level data structure in Python, dynamic Types and dynamic binding make it very suitable for rapid application development, and also suitable as a glue language to connect existing software components. It is also very simple to use Python to create this proxy IP pool. The code is divided into 6 modules:

Api: api interface related code. The api is currently implemented by Flask, and the code is also very simple. The client request is passed to Flask, and Flask calls the implementation in ProxyManager, including get/delete/refresh/get_all;

DB: database related code. The current database uses SSDB. The code is implemented in factory mode to facilitate expansion of other types of databases in the future;

Manager: get/delete/refresh/get_all and other interface specific implementation classes. Currently, the proxy pool is only responsible for managing the proxy. In the future, There may be more functions, such as the binding of agents and crawlers, the binding of agents and accounts, etc.;

ProxyGetter: Relevant codes obtained by agents. Currently, fast agents, agents 66, and agents are captured. , Xisha Proxy, and guobanjia are free proxies for five websites. After testing, these five websites only have sixty or seventy available proxies that are updated every day. Of course, they also support their own expansion of the proxy interface;

Schedule: Scheduled task related The code is now just implemented to refresh the code regularly and verify the available agents, using a multi-process approach;

Util: Stores some public module methods or functions, including GetConfig: the class that reads the configuration file config.ini, ConfigParse: integrated rewriting ConfigParser's class makes it case-sensitive, Singleton: implements a singleton, LazyProperty: implements lazy calculation of class properties. Etc.;

Other files: Configuration file: Config.ini, database configuration and proxy acquisition interface configuration. You can add a new proxy acquisition method in GetFreeProxy and register it in Config.ini to use;

4. Installation

Download code:

git clone git@github.com:jhao104/proxy_pool.git

或者直接到https://github.com/jhao104/proxy_pool 下载zip文件

Installation dependencies:

pip install -r requirements.txt

Startup:

需要分别启动定时任务和api
到Config.ini中配置你的SSDB

到Schedule目录下:
>>>python ProxyRefreshSchedule.py

到Api目录下:
>>>python ProxyApi.py

5. Use

After the scheduled task is started, all agents will be fetched into the database through the agent acquisition method and verified. Thereafter, it will be repeated every 20 minutes by default. About a minute or two after the scheduled task is started, you can see the available proxies refreshed in SSDB:

Python crawler agent IP pool implementation method

You can use it in the browser after starting ProxyApi.py The interface gets the proxy, here is the screenshot in the browser:
index page:

Python crawler agent IP pool implementation method

get page:

Python crawler agent IP pool implementation method

get_all page:

Python crawler agent IP pool implementation method

Used in crawlers. If you want to use it in crawler code, you can encapsulate this api into a function and use it directly. , For example:

import requests

def get_proxy():
  return requests.get("http://127.0.0.1:5000/get/").content

def delete_proxy(proxy):
  requests.get("http://127.0.0.1:5000/delete/?proxy={}".format(proxy))

# your spider code

def spider():
  # ....
  requests.get(&#39;https://www.example.com&#39;, proxies={"http": "http://{}".format(get_proxy)})
  # ....

6. Finally

I am in a hurry and the functions and codes are relatively simple. I will improve it when I have time in the future. If you like it, give it a star on github. grateful!

For more articles related to Python crawler agent IP pool implementation methods, please pay attention to the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

Learn the Differences Between Python's 'for' and 'while' LoopsMay 08, 2025 am 12:11 AM

ThekeydifferencesbetweenPython's"for"and"while"loopsare:1)"For"loopsareidealforiteratingoversequencesorknowniterations,while2)"while"loopsarebetterforcontinuinguntilaconditionismetwithoutpredefinediterations.Un

Python concatenate lists with duplicatesMay 08, 2025 am 12:09 AM

In Python, you can connect lists and manage duplicate elements through a variety of methods: 1) Use operators or extend() to retain all duplicate elements; 2) Convert to sets and then return to lists to remove all duplicate elements, but the original order will be lost; 3) Use loops or list comprehensions to combine sets to remove duplicate elements and maintain the original order.

Python List Concatenation Performance: Speed ComparisonMay 08, 2025 am 12:09 AM

ThefastestmethodforlistconcatenationinPythondependsonlistsize:1)Forsmalllists,the operatorisefficient.2)Forlargerlists,list.extend()orlistcomprehensionisfaster,withextend()beingmorememory-efficientbymodifyinglistsin-place.

How do you insert elements into a Python list?May 08, 2025 am 12:07 AM

ToinsertelementsintoaPythonlist,useappend()toaddtotheend,insert()foraspecificposition,andextend()formultipleelements.1)Useappend()foraddingsingleitemstotheend.2)Useinsert()toaddataspecificindex,thoughit'sslowerforlargelists.3)Useextend()toaddmultiple

Are Python lists dynamic arrays or linked lists under the hood?May 07, 2025 am 12:16 AM

Pythonlistsareimplementedasdynamicarrays,notlinkedlists.1)Theyarestoredincontiguousmemoryblocks,whichmayrequirereallocationwhenappendingitems,impactingperformance.2)Linkedlistswouldofferefficientinsertions/deletionsbutslowerindexedaccess,leadingPytho

How do you remove elements from a Python list?May 07, 2025 am 12:15 AM

Pythonoffersfourmainmethodstoremoveelementsfromalist:1)remove(value)removesthefirstoccurrenceofavalue,2)pop(index)removesandreturnsanelementataspecifiedindex,3)delstatementremoveselementsbyindexorslice,and4)clear()removesallitemsfromthelist.Eachmetho

What should you check if you get a 'Permission denied' error when trying to run a script?May 07, 2025 am 12:12 AM

Toresolvea"Permissiondenied"errorwhenrunningascript,followthesesteps:1)Checkandadjustthescript'spermissionsusingchmod xmyscript.shtomakeitexecutable.2)Ensurethescriptislocatedinadirectorywhereyouhavewritepermissions,suchasyourhomedirectory.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

SublimeText3 English version

Recommended: Win version, supports code prompts!

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.