cari

Rumah  >  Soal Jawab  >  teks badan

python - scrapy pipeline报错求助

由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return的东西传去哪了?
是不是下面代码的item?

class pipeline :
    def process_item(self, item, spider):

我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点

spider:

# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re


class InfospSpider(scrapy.Spider):
    name = "infosp"
    allowed_domains = ["pm25.com"]
    start_urls = ['http://www.pm25.com/rank/1day.html', ]

    def parse(self, response):
        item = Pm25Item()
        re_time = re.compile("\d+-\d+-\d+")
        date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE
        # items = []

        selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围
        for subselector in selector: #通过范围逐条解析
            try: #防止[0]报错
                rank = subselector.xpath("span[1]/text()").extract()[0] 
                quality = subselector.xpath("span/em/text()")[0].extract()
                city = subselector.xpath("a/text()").extract()[0]
                province = subselector.xpath("span[3]/text()").extract()[0]
                aqi = subselector.xpath("span[4]/text()").extract()[0]
                pm25 = subselector.xpath("span[5]/text()").extract()[0]
            except IndexError:
                print(rank,quality,city,province,aqi,pm25)

            item['date'] = re_time.findall(date)[0]
            item['rank'] = rank
            item['quality'] = quality
            item['province'] = city
            item['city'] = province
            item['aqi'] = aqi
            item['pm25'] = pm25
            # items.append(item)

            yield item #这里不懂该怎么用,出来的是什么格式,
                       #有的教程会return items,所以希望能得到指点

pipeline:

import time

class Pm25Pipeline(object):

    def process_item(self, item, spider):
        today = time.strftime("%y%m%d",time.localtime())
        fname = str(today) + ".txt"

        with open(fname,"a") as f:
            for tmp in item: #不知道这里是否写的对,
                             #个人理解是spider return出来的item是yiled dict
                             #[{a:1,aa:11},{b:2,bb:22},{......}]
                f.write(tmp["date"] + '\t' +
                        tmp["rank"] + '\t' +
                        tmp["quality"] + '\t' +
                        tmp["province"] + '\t' +
                        tmp["city"] + '\t' +
                        tmp["aqi"] + '\t' +
                        tmp["pm25"] + '\n'
                        )
            f.close()
        return item

items:

import scrapy

class Pm25Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    rank = scrapy.Field()
    quality = scrapy.Field()
    province = scrapy.Field()
    city = scrapy.Field()
    aqi = scrapy.Field()
    pm25 = scrapy.Field()
    pass

部分运行报错代码:

Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
 'city': '新疆',
 'date': '2017-04-02',
 'pm25': '13 ',
 'province': '伊犁哈萨克州',
 'quality': '优',
 'rank': '357'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '西藏',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '林芝',
 'quality': '优',
 'rank': '358'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '丽江',
 'quality': '优',
 'rank': '359'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '15 ',
 'province': '玉溪',
 'quality': '优',
 'rank': '360'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '10 ',
 'province': '楚雄州',
 'quality': '优',
 'rank': '361'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '迪庆州',
 'quality': '优',
 'rank': '362'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '9 ',
 'province': '怒江州',
 'quality': '优',
 'rank': '363'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 38229,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 363,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)

希望能到到各位老师的帮助再次感谢~!

天蓬老师天蓬老师2795 hari yang lalu617

membalas semua(4)saya akan balas

  • PHP中文网

    PHP中文网2017-04-18 10:33:25

    Tulis terus sahaja, tidak perlu gelung, item diproses secara individu, bukan senarai seperti yang anda fikirkan:

    import time
    
    class Pm25Pipeline(object):
    
        def process_item(self, item, spider):
            today = time.strftime("%y%m%d", time.localtime())
            fname = str(today) + ".txt"
    
            with open(fname, "a") as f:
                f.write(item["date"] + '\t' +
                        item["rank"] + '\t' +
                        item["quality"] + '\t' +
                        item["province"] + '\t' +
                        item["city"] + '\t' +
                        item["aqi"] + '\t' +
                        item["pm25"] + '\n'
                        )
            f.close()
            return item

    balas
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:33:25

    Cari untuk: TypeError: indeks rentetan mestilah integer, ketahui masalahnya
    Cari bilangan baris dan selesaikan masalah

    balas
    0
  • 大家讲道理

    大家讲道理2017-04-18 10:33:25

    Item Scrapy adalah serupa dengan kamus python, dengan beberapa fungsi lanjutan.

    Reka bentuk Scrapy, setiap kali Item dijana, ia boleh dihantar ke saluran paip untuk diproses. for tmp in item yang anda tulis di dalamnya melingkari kekunci kamus item. Kekunci itu mestilah rentetan Jika anda menggunakan sintaks __getitem__, anda akan digesa untuk menggunakan nombor dan bukannya nombor.

    balas
    0
  • 高洛峰

    高洛峰2017-04-18 10:33:25

    Anda boleh menganggap item sebagai kamus, yang sebenarnya merupakan kelas terbitan daripada kelas dict. Anda melintasi terus pipeline ini dalam item, dan tmp yang diperoleh sebenarnya adalah kunci kamus dan jenisnya ialah rentetan, jadi tmp['pm25'] operasi jenis ini melaporkan TypeError:string类型的对象索引必须是int型.

    balas
    0
  • Batalbalas