Maison > Questions et réponses > le corps du texte
# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem
from scrapy.http import Request
class WeatherSpider(scrapy.Spider):
name = "myweather"
allowed_domains = ["http://xjh.haitou.cc/nj/uni-21"]
start_urls = ["http://xjh.haitou.cc/nj/uni-21/page-2"]
url="http://xjh.haitou.cc"
def parse(self, response):
item = WeatherItem()
preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
for preach in preachs:
item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
yield item
nextlink=response.xpath('//li[@class="next"]/a/@href').extract()
if nextlink:
link=nextlink[0]
print "##############"
print self.url+link
print "##############"
yield Request(self.url+link,callback=self.parse )
##############
http://xjh.haitou.cc/nj/uni-21/page-3
##############
2015-10-23 22:05:57 [scrapy] DEBUG: Filtered offsite request to 'xjh.haitou.cc': <GET http://xjh.haitou.cc/nj/uni-21/page-3>
2015-10-23 22:05:57 [scrapy] INFO: Closing spider (finished)
2015-10-23 22:05:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 10508,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 23, 14, 5, 57, 9032),
'item_scraped_count': 20,
'log_count/DEBUG': 23,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2015, 10, 23, 14, 5, 56, 662979)}
2015-10-23 22:05:57 [scrapy] INFO: Spider closed (finished)
怪我咯2017-04-17 16:13:17
把你的allowed_domains 和 start_urls修改一下即可(为了代码简洁,删除 url="http://xjh.haitou.cc 这个定义(没必要))。
修改完后,判断有一下还有一下的话,继续爬取数据:
yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse )
代码修改如下,原因就不说了,建议参考官方文档。
class WeatherSpider(scrapy.Spider):
name = "myweather"
allowed_domains = ["xjh.haitou.cc"]
start_urls = ["http://xjh.haitou.cc/nj/uni-21"]
def parse(self, response):
item = WeatherItem()
preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
for preach in preachs:
item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
yield item
nextlink=response.xpath('//li[@class="next"]/a/@href').extract()
if nextlink:
yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse )
2015-10-26 15:59:58 [scrapy] INFO: Closing spider (finished)
2015-10-26 15:59:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2247,
'downloader/request_count': 7,
'downloader/request_method_count/GET': 7,
'downloader/response_bytes': 71771,
'downloader/response_count': 7,
'downloader/response_status_count/200': 7,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 26, 7, 59, 58, 975394),
'item_scraped_count': 132,
'log_count/DEBUG': 139,
'log_count/INFO': 7,
'request_depth_max': 6,
'response_received_count': 7,
'scheduler/dequeued': 7,
'scheduler/dequeued/memory': 7,
'scheduler/enqueued': 7,
'scheduler/enqueued/memory': 7,
'start_time': datetime.datetime(2015, 10, 26, 7, 59, 56, 500595)}
2015-10-26 15:59:58 [scrapy] INFO: Spider closed (finished)
部分数据如下:
1:{"date": ["2015-10-26 12:00"], "corp": ["大通证券股份有限公司"], "location": ["教一-508"], "click": ["159"]}
2:{"date": ["2015-10-26 14:00"], "corp": ["Goa大象设计"], "location": ["四牌楼校区中大院309"], "click": ["497"]}
3:{"date": ["2015-10-26 14:00"], "corp": ["中国建筑西南勘察设计研究院有限公司"], "location": ["四牌楼校区中山院111"], "click": ["403"]}
4:{"date": ["2015-10-26 14:00"], "corp": ["苏州桑泰海洋仪器研发有限责任公司"], "location": ["四牌楼校区中山院201"], "click": ["624"]}
5:{"date": ["2015-10-26 14:00"], "corp": ["大唐电信科技股份有限公司"], "location": ["四牌楼校区致知堂"], "click": ["1031"]}
6:{"date": ["2015-10-26 14:00"], "corp": ["华信咨询设计研究院有限公司"], "location": ["教六403"], "click": ["373"]}
7:{"date": ["2015-10-26 14:00"], "corp": ["山石网科通信技术有限公司"], "location": ["九龙湖校区教四302"], "click": ["573"]}
8:{"date": ["2015-10-26 18:30"], "corp": ["北京凯晨置业有限公司"], "location": ["四牌楼校区榴园宾馆逸夫科技馆"], "click": ["254"]}
9:{"date": ["2015-10-26 18:30"], "corp": ["中国建筑国际集团有限公司"], "location": ["四牌楼校区礼东101"], "click": ["237"]}
10:{"date": ["2015-10-26 18:30"], "corp": ["无锡华润微电子有限公司"], "location": ["四牌楼校区群贤楼三楼报告厅"], "click": ["607"]}
11:{"date": ["2015-10-26 19:00"], "corp": ["上海斐讯数据通信技术有限公司"], "location": ["教一208"], "click": ["461"]}
.....
.....
129:{"date": ["2015-11-16 14:00"], "corp": ["人本集团有限公司"], "location": ["大学生活动中心322多功能厅"], "click": ["26"]}
130:{"date": ["2015-11-17 18:30"], "corp": ["仲量联行测量师事务所(上海)有限公司"], "location": ["九龙湖大学生活动中心324报"], "click": ["19"]}
131:{"date": ["2015-11-18 15:30"], "corp": ["厦门中骏集团有限公司"], "location": ["四牌楼榴园新华厅"], "click": ["63"]}
132:{"date": ["2015-11-19 14:00"], "corp": ["理士国际技术有限公司"], "location": ["九龙湖大学生活动中心322报"], "click": ["22"]}