Rumah > Soal Jawab > teks badan
# -*- encoding: utf8 -*-
import urllib
import urllib2
import re
page = 1
url = u'http://math.xmu.edu.cn/' + str(page)
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
headers = { 'User-Agent' : user_agent}
try:
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
pattern = re.compile(r'<article class="home_news_l">.*?<p>(.*?)</p>.*?<p>(.*?)</p></article>',re.S)
items = re.findall(pattern,content)
for item in items:
print item.encode('utf-8')
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason
这个网站我可以打开,但是用爬虫就是404,我也有head..不知道问题出在哪了,谢谢你
巴扎黑2017-04-17 17:45:42
URL yang anda bina ialah http://math.xmu.edu.cn/1
url ini tidak wujud dahulu