집 >백엔드 개발 >파이썬 튜토리얼 >py3가 웹 리소스를 크롤링하는 10가지 방법 공유

py3가 웹 리소스를 크롤링하는 10가지 방법 공유

Y2J원래의: 2017-05-11 11:04:142964검색

지난 이틀 동안 Python3을 사용하여 웹 리소스를 크롤링하는 방법을 배우고 많은 방법을 찾았으므로 오늘 몇 가지 메모를 추가하겠습니다.

1. 가장 간단합니다

import urllib.request
response = urllib.request.urlopen(&#39;http://python.org/&#39;)
html = response.read()

2. 요청 사용

import urllib.request
 
req = urllib.request.Request(&#39;http://python.org/&#39;)
response = urllib.request.urlopen(req)
the_page = response.read()

3. 데이터 및 헤더 보내기

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = &#39;http://localhost/login.php&#39;
user_agent = &#39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&#39;
values = {
     &#39;act&#39; : &#39;login&#39;,
     &#39;login[email]&#39; : &#39;yzhang@i9i8.com&#39;,
     &#39;login[password]&#39; : &#39;123456&#39;
     }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header(&#39;Referer&#39;, &#39;http://www.python.org/&#39;)
response = urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

5. http 오류

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = &#39;http://localhost/login.php&#39;
user_agent = &#39;Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)&#39;
values = {
     &#39;act&#39; : &#39;login&#39;,
     &#39;login[email]&#39; : &#39;yzhang@i9i8.com&#39;,
     &#39;login[password]&#39; : &#39;123456&#39;
     }
headers = { &#39;User-Agent&#39; : user_agent }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

6. 예외 처리 1

#! /usr/bin/env python3
 
import urllib.request
 
req = urllib.request.Request(&#39;http://www.python.org/fish.html&#39;)
try:
  urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
  print(e.code)
  print(e.read().decode("utf8"))

7. 예외 처리 2

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://twitter.com/")
try:
  response = urlopen(req)
except HTTPError as e:
  print(&#39;The server couldn\&#39;t fulfill the request.&#39;)
  print(&#39;Error code: &#39;, e.code)
except URLError as e:
  print(&#39;We failed to reach a server.&#39;)
  print(&#39;Reason: &#39;, e.reason)
else:
  print("good!")
  print(response.read().decode("utf8"))

8. >9. 프록시 사용

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("http://twitter.com/")
try:
  response = urlopen(req)
except URLError as e:
  if hasattr(e, &#39;reason&#39;):
    print(&#39;We failed to reach a server.&#39;)
    print(&#39;Reason: &#39;, e.reason)
  elif hasattr(e, &#39;code&#39;):
    print(&#39;The server couldn\&#39;t fulfill the request.&#39;)
    print(&#39;Error code: &#39;, e.code)
else:
  print("good!")
  print(response.read().decode("utf8"))

10. 시간 초과

#! /usr/bin/env python3
 
import urllib.request
 
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
 
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://cms.tetx.com/"
password_mgr.add_password(None, top_level_url, &#39;yzhang&#39;, &#39;cccddd&#39;)
 
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
 
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
 
# use the opener to fetch a URL
a_url = "https://cms.tetx.com/"
x = opener.open(a_url)
print(x.read())
 
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
 
a = urllib.request.urlopen(a_url).read().decode(&#39;utf8&#39;)
print(a)

[관련 권장 사항]

1. Python 무료 동영상 튜토리얼

파이썬 학습 매뉴얼

마르코 교육용 파이썬 기초 문법 설명 영상

위 내용은 py3가 웹 리소스를 크롤링하는 10가지 방법 공유의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명：

이전 기사：Pycharm의 코드 스타일 설정에 대한 그래픽 튜토리얼다음 기사：Pycharm의 코드 스타일 설정에 대한 그래픽 튜토리얼