python - 如何为爬虫构建代理池

Question

为了避免爬虫被封 IP ，网上搜索教程说需要建立代理池。但是付费代理都好贵。。。不过好在网上已经有不少免费提供代理的网站了。因此，我打算写个爬虫去获取这些免费 IP ～ 策略步骤 用种子关键词例如“代理 IP ”...

怪我咯 · Answer

Let’s probably write it down. I happened to have done the same job before, and I also needed an agent at that time. Then I wrote my own crawler to do automatic retrieval and automatic update.

As for the proxy address, I didn’t let the crawler choose the website by itself. Instead, I manually screened several websites that provided free proxies and then wrote some crawlers to crawl different proxy websites;

In response to the difficulty you mentioned:

For verification, the address crawled for the first time will be directly verified to see if it is available. If it can be used, it will be stored in the database or persisted. Because of the unreliability of the proxy, it is necessary to regularly check whether the captured proxy is available. I started it directly on the uWSGI server. Create a scheduled task, which will be checked every half hour and a new agent will be captured every hour. Of course, you can also use scheduled tasks such as crontab;
Directly use the captured proxy to access the website you need to visit. If you need to provide different proxies for different websites, you can verify and store the relevant verification information together;
Efficiency issues are easy to deal with. Network verification operations are all I/O-intensive tasks, which can be solved with coroutines, multi-threads, and multi-processes. Python's gil does not affect multi-threading and improves the efficiency of I/O-intensive tasks

multithreading-spider I used multi-threading + queue to make a simple proxy crawler before. The demo of src is a specific example. It uses a simple producer-consumer model. The crawler that crawls to the proxy address is regarded as a producer to verify the availability of the proxy. The crawler acts as a consumer and can display specific task progress.

PHP中文网 · Answer

You can try this, a Python-based proxy pool.
Automatically capture proxy resources on the Internet and facilitate expansion.
https://github.com/WiseDoge/P...

ringa_lee · Answer

You can take a look at this project: https://github.com/jhao104/pr...

Open source proxy pool service

python - 如何为爬虫构建代理池

reply all(3)I'll reply