Home >Backend Development >Python Tutorial >Detailed explanation of how Python crawlers use proxy to crawl web pages

Detailed explanation of how Python crawlers use proxy to crawl web pages

高洛峰
高洛峰Original
2017-03-19 14:43:462010browse

Proxy type (proxy): transparent proxy, anonymous proxy, confusion proxy and high-anonymity proxy. Here is some knowledge of pythoncrawlers using proxy, and a proxy pool class. It is convenient for everyone to deal with various aspects of work. A complex crawling problem.

urllib module uses proxy

urllib/urllib2 It is more troublesome to use proxy. You need to build a ProxyHandler class first, and then use this class to build the opener class that opens the web page, and then in the request Install the opener.

The proxy format is "http://127.0.0.1:80". If you want the account password, it is "http://user:password@127.0.0.1:80".

proxy="http://127.0.0.1:80"
# 创建一个ProxyHandler对象
proxy_support=urllib.request.ProxyHandler({'http':proxy})
# 创建一个opener对象
opener = urllib.request.build_opener(proxy_support)
# 给request装载opener
urllib.request.install_opener(opener)
# 打开一个url
r = urllib.request.urlopen('http://youtube.com',timeout = 500)

requests module uses proxy

Using proxy for requests is much simpler than urllib...Here is a single proxy as an example. If it is used multiple times, you can use session to build a class.

If you need to use a proxy, you can configure a single request by providing the proxies parameter to any request method:

import requests
proxies = {
  "http": "http://127.0.0.1:3128",
  "https": "http://127.0.0.1:2080",
}
r=requests.get("http://youtube.com", proxies=proxies)
print r.text

You can also configure the proxy through the environment variables HTTP_PROXY and HTTPS_PROXY.

export HTTP_PROXY="http://127.0.0.1:3128"
export HTTPS_PROXY="http://127.0.0.1:2080"
python
>>> import requests
>>> r=requests.get("http://youtube.com")
>>> print r.text

If your proxy needs to use HTTP Basic Auth, you can use http://user:password@host/ Syntax:

proxies = {
    "http": "http://user:pass@127.0.0.1:3128/",
}

Using python's proxy is very simple. The most important thing is to Find an agent with a stable and reliable network. If you have any questions, please leave a message

The above is the detailed content of Detailed explanation of how Python crawlers use proxy to crawl web pages. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn