Home  >  Q&A  >  body text

python - scrapy自动翻页采集,第二页跳转后,爬虫自动结束

# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem
from scrapy.http import Request


class WeatherSpider(scrapy.Spider):
    name = "myweather"
    allowed_domains = ["http://xjh.haitou.cc/nj/uni-21"]
    start_urls = ["http://xjh.haitou.cc/nj/uni-21/page-2"]

    url="http://xjh.haitou.cc"

    def parse(self, response):
        item = WeatherItem()
        preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
        for preach in preachs:
            item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
            item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
            item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
            item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
            yield item

        nextlink=response.xpath('//li[@class="next"]/a/@href').extract()

        if nextlink:
            link=nextlink[0]
            print "##############"
            print self.url+link
            print "##############"

            yield Request(self.url+link,callback=self.parse )
##############
http://xjh.haitou.cc/nj/uni-21/page-3
##############
2015-10-23 22:05:57 [scrapy] DEBUG: Filtered offsite request to 'xjh.haitou.cc': <GET http://xjh.haitou.cc/nj/uni-21/page-3>
2015-10-23 22:05:57 [scrapy] INFO: Closing spider (finished)
2015-10-23 22:05:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 10508,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 23, 14, 5, 57, 9032),
 'item_scraped_count': 20,
 'log_count/DEBUG': 23,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 10, 23, 14, 5, 56, 662979)}
2015-10-23 22:05:57 [scrapy] INFO: Spider closed (finished)
伊谢尔伦伊谢尔伦2742 days ago1016

reply all(2)I'll reply

  • 怪我咯

    怪我咯2017-04-17 16:13:17

    Just modify your allowed_domains and start_urls (for the simplicity of the code, delete the definition of url="http://xjh.haitou.cc (unnecessary)).
    After modifying, if there is a difference, Continue to scrape data:
    yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse)

    The code is modified as follows. I won’t go into the reasons. It is recommended to refer to the official documentation.

    class WeatherSpider(scrapy.Spider):
        name = "myweather"
        allowed_domains = ["xjh.haitou.cc"]
        start_urls = ["http://xjh.haitou.cc/nj/uni-21"]
        def parse(self, response):
            item = WeatherItem()
            preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
            for preach in preachs:
                item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
                item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
                item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
                item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
                yield item
    
            nextlink=response.xpath('//li[@class="next"]/a/@href').extract()
    
            if nextlink:
                yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse )
    2015-10-26 15:59:58 [scrapy] INFO: Closing spider (finished)
    2015-10-26 15:59:58 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2247,
     'downloader/request_count': 7,
     'downloader/request_method_count/GET': 7,
     'downloader/response_bytes': 71771,
     'downloader/response_count': 7,
     'downloader/response_status_count/200': 7,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 10, 26, 7, 59, 58, 975394),
     'item_scraped_count': 132,
     'log_count/DEBUG': 139,
     'log_count/INFO': 7,
     'request_depth_max': 6,
     'response_received_count': 7,
     'scheduler/dequeued': 7,
     'scheduler/dequeued/memory': 7,
     'scheduler/enqueued': 7,
     'scheduler/enqueued/memory': 7,
     'start_time': datetime.datetime(2015, 10, 26, 7, 59, 56, 500595)}
    2015-10-26 15:59:58 [scrapy] INFO: Spider closed (finished)

    Part of the data is as follows:
    1:{"date": ["2015-10-26 12:00"], "corp": ["Datong Securities Co., Ltd."], "location": ["Jiaoyi-508 "], "click": ["159"]}
    2:{"date": ["2015-10-26 14:00"], "corp": ["Goa Elephant Design"], "location" : ["309, Zhongdayuan, Sipailou Campus"], "click": ["497"]}
    3:{"date": ["2015-10-26 14:00"], "corp": [" China Southwest Architectural Survey and Design Institute Co., Ltd."], "location": ["111, Zhongshan Institute, Sipailou Campus"], "click": ["403"]}
    4:{"date": ["2015-10 -26 14:00"], "corp": ["Suzhou Suntai Marine Instrument R&D Co., Ltd."], "location": ["201, Sun Yat-sen University, Sipailou Campus"], "click": ["624"] }
    5:{"date": ["2015-10-26 14:00"], "corp": ["Datang Telecom Technology Co., Ltd."], "location": ["Zhizhi Hall, Sipailou Campus" ], "click": ["1031"]}
    6:{"date": ["2015-10-26 14:00"], "corp": ["Huaxin Consulting Design Institute Co., Ltd."], "location": ["Jiaoliu 403"], "click": ["373"]}
    7:{"date": ["2015-10-26 14:00"], "corp": ["山石Netcom Communication Technology Co., Ltd."], "location": ["Jiulong Lake Campus Teaching 4 302"], "click": ["573"]}
    8:{"date": ["2015-10-26 18 :30"], "corp": ["Beijing Kaichen Real Estate Co., Ltd."], "location": ["Yifu Science and Technology Museum, Liuyuan Hotel, Sipailou Campus"], "click": ["254"]}
    9 :{"date": ["2015-10-26 18:30"], "corp": ["China Construction International Group Co., Ltd."], "location": ["Lidong 101, Sipailou Campus"], " click": ["237"]}
    10:{"date": ["2015-10-26 18:30"], "corp": ["Wuxi China Resources Microelectronics Co., Ltd."], "location": [ "Lecture Hall on the third floor of Qunxian Building, Sipailou Campus"], "click": ["607"]}
    11:{"date": ["2015-10-26 19:00"], "corp": [ "Shanghai Feixun Data Communication Technology Co., Ltd."], "location": ["Jiaoyi 208"], "click": ["461"]}
    ....
    ....
    129:{ "date": ["2015-11-16 14:00"], "corp": ["Renben Group Co., Ltd."], "location": ["College Student Activity Center 322 Multi-Function Hall"], "click" : ["26"]}
    130:{"date": ["2015-11-17 18:30"], "corp": ["Jones Lang LaSalle Surveyors (Shanghai) Co., Ltd."], "location": ["Jiulong Lake Student Activity Center 324 News"], "click": ["19"]}
    131:{"date": ["2015-11-18 15:30"], "corp" : ["Xiamen Zhongjun Group Co., Ltd."], "location": ["Sipailou Liuyuan Xinhua Hall"], "click": ["63"]}
    132:{"date": ["2015-11 -19 14:00"], "corp": ["Leoch International Technology Co., Ltd."], "location": ["Jiulong Lake Student Activity Center 322 Newspaper"], "click": ["22"]}

    reply
    0
  • 迷茫

    迷茫2017-04-17 16:13:17

    Give you a suggested reference

    Reference link

    reply
    0
  • Cancelreply