Home  >  Article  >  Backend Development  >  python crawler Scrapy uses proxy configuration

python crawler Scrapy uses proxy configuration

高洛峰
高洛峰Original
2016-10-17 13:56:572284browse

When crawling website content, the most common problem encountered is: the website has restrictions on IP and has anti-crawling functions. The best way is to rotate IP crawling (adding a proxy)

Let’s talk about Scrapy How to configure the agent and crawl

1. Create a new "middlewares.py" under the Scrapy project

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64 
# Start your middleware class
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
  
        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2. Add

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'pythontab.middlewares.ProxyMiddleware': 100,
}

to the project configuration file (./pythontab/settings.py)


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn