打算爬贴吧,我是想获取每一页的帖子的链接,然后再根据帖子链接提取帖子里面的内容,提取某一页帖子的链接的代码已经写好,但是发现只提取了3页爬虫就结束了,这是什么问题?这是我的代码:
#coding:utf-8
import scrapy
class TiebaSpider(scrapy.Spider):
name = "tiebapost"
start_urls = [
'http://tieba.baidu.com/f?kw=%E6%B8%A1%E8%BE%B9%E9%BA%BB%E5%8F%8B&ie=utf-8&pn=0'
]
def parse(self, response):
output = open('e:/scrapy_tutorial/link.txt', 'w+')
count = 0
for post in response.css('p.j_th_tit'):
post_link = post.css('a.j_th_tit::attr(href)').extract()
output.write('http://tieba.baidu.com' + post_link[0] + '\n')
count += 1
print u"提取到的链接:", post_link
print u'总共', count, u'条链接'
next_page = response.css('a.pagination-item::attr(href)').extract_first()
if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)
PHP中文网2017-04-18 10:12:11
被 tieba.baidu.com 批量爬取時,會產生 403 或其它非200 的響應代碼,這裡頁面就打不開了,也就沒有了下一頁,建議看看這個文檔遇到的問題