>  Q&A  >  본문

python - 为什么我直接用requests爬网页可以,但用scrapy不行?

class job51():
    def __init__(self):
        self.headers={
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cache-Control': 'max-age=0',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Cookie':''
}

    def start(self):
        html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers)
        self.parse(html)

    def parse(self,response):
        tree=lxml.etree.HTML(response.text)
        resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href')
        print (resume_url[0]

能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面?

class job51(Spider):
    name = "job51"
    #allowed_domains = ["my.51job.com"]
    start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"]
    headers={
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cache-Control': 'max-age=0',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Cookie':''
}

    def start_requests(self):
        yield  Request(url=self.start_urls[0],headers=self.headers,callback=self.parse)

    def parse(self,response):
        #tree=lxml.etree.HTML(text)
        selector=Selector(response)
        print ("<<<<<<<<<<<<<<<<<<<<<",response.text)
        resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href')
        print (">>>>>>>>>>>>",resume_url)

输出的结果:

scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'}
2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened
2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None)
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
<<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>
>>>>>>>>>>>> []
2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse
    yield Request(resume_url[0],headers=self.headers,callback=self.getResume)
  File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 628,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 5743,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)}
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)

PHPzPHPz2741일 전1009

모든 응답(3)나는 대답할 것이다

  • 阿神

    阿神2017-04-18 10:35:18

    로그에 404가 표시됩니다. scrapy 설정에서 리디렉션이 비활성화되어 있는지 확인하세요.

    회신하다
    0
  • PHP中文网

    PHP中文网2017-04-18 10:35:18

    으아아아

    여기에서 scrapy를 사용하여 작성한 크롤러가 로그인 페이지로 리디렉션되는 것을 볼 수 있습니다. 따라서 오류가 보고됩니다. 요청 및 scrapy를 사용할 때 패키지를 캡처하여 응답 콘텐츠를 확인하고 해당 요청 헤더가 정확히 동일한지 확인하는 것이 좋습니다. 쿠키가 만료되었거나 scrapy가 이런 식으로 쿠키를 전송하지 못할 수도 있다고 생각합니다. 저는 scrapy에 대해 특별히 익숙하지 않지만 문제는 쿠키에 있을 것입니다

    회신하다
    0
  • 迷茫

    迷茫2017-04-18 10:35:18

    세션 요청에 사용한 실제 요청 헤더에는 이미 쿠키가 로드되어 있을 가능성이 높으므로 위에서 언급한 요청 헤더를 비교하는 것이 좋습니다

    회신하다
    0
  • 취소회신하다