search

Home  >  Q&A  >  body text

python - scrapy pipeline报错求助

由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return的东西传去哪了?
是不是下面代码的item?

class pipeline :
    def process_item(self, item, spider):

我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点

spider:

# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re


class InfospSpider(scrapy.Spider):
    name = "infosp"
    allowed_domains = ["pm25.com"]
    start_urls = ['http://www.pm25.com/rank/1day.html', ]

    def parse(self, response):
        item = Pm25Item()
        re_time = re.compile("\d+-\d+-\d+")
        date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE
        # items = []

        selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围
        for subselector in selector: #通过范围逐条解析
            try: #防止[0]报错
                rank = subselector.xpath("span[1]/text()").extract()[0] 
                quality = subselector.xpath("span/em/text()")[0].extract()
                city = subselector.xpath("a/text()").extract()[0]
                province = subselector.xpath("span[3]/text()").extract()[0]
                aqi = subselector.xpath("span[4]/text()").extract()[0]
                pm25 = subselector.xpath("span[5]/text()").extract()[0]
            except IndexError:
                print(rank,quality,city,province,aqi,pm25)

            item['date'] = re_time.findall(date)[0]
            item['rank'] = rank
            item['quality'] = quality
            item['province'] = city
            item['city'] = province
            item['aqi'] = aqi
            item['pm25'] = pm25
            # items.append(item)

            yield item #这里不懂该怎么用,出来的是什么格式,
                       #有的教程会return items,所以希望能得到指点

pipeline:

import time

class Pm25Pipeline(object):

    def process_item(self, item, spider):
        today = time.strftime("%y%m%d",time.localtime())
        fname = str(today) + ".txt"

        with open(fname,"a") as f:
            for tmp in item: #不知道这里是否写的对,
                             #个人理解是spider return出来的item是yiled dict
                             #[{a:1,aa:11},{b:2,bb:22},{......}]
                f.write(tmp["date"] + '\t' +
                        tmp["rank"] + '\t' +
                        tmp["quality"] + '\t' +
                        tmp["province"] + '\t' +
                        tmp["city"] + '\t' +
                        tmp["aqi"] + '\t' +
                        tmp["pm25"] + '\n'
                        )
            f.close()
        return item

items:

import scrapy

class Pm25Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    rank = scrapy.Field()
    quality = scrapy.Field()
    province = scrapy.Field()
    city = scrapy.Field()
    aqi = scrapy.Field()
    pm25 = scrapy.Field()
    pass

部分运行报错代码:

Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
 'city': '新疆',
 'date': '2017-04-02',
 'pm25': '13 ',
 'province': '伊犁哈萨克州',
 'quality': '优',
 'rank': '357'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '西藏',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '林芝',
 'quality': '优',
 'rank': '358'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '丽江',
 'quality': '优',
 'rank': '359'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '15 ',
 'province': '玉溪',
 'quality': '优',
 'rank': '360'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '10 ',
 'province': '楚雄州',
 'quality': '优',
 'rank': '361'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '迪庆州',
 'quality': '优',
 'rank': '362'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '9 ',
 'province': '怒江州',
 'quality': '优',
 'rank': '363'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 38229,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 363,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)

希望能到到各位老师的帮助再次感谢~!

天蓬老师天蓬老师2787 days ago604

reply all(4)I'll reply

  • PHP中文网

    PHP中文网2017-04-18 10:33:25

    Just write it directly, no need to do a loop, the item is processed individually, not a list like you think:

    import time
    
    class Pm25Pipeline(object):
    
        def process_item(self, item, spider):
            today = time.strftime("%y%m%d", time.localtime())
            fname = str(today) + ".txt"
    
            with open(fname, "a") as f:
                f.write(item["date"] + '\t' +
                        item["rank"] + '\t' +
                        item["quality"] + '\t' +
                        item["province"] + '\t' +
                        item["city"] + '\t' +
                        item["aqi"] + '\t' +
                        item["pm25"] + '\n'
                        )
            f.close()
            return item

    reply
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:33:25

    Search: TypeError: string indices must be integers, figure out what the problem is
    Locate the number of lines, and solve the problem

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-18 10:33:25

    Scrapy's Item is similar to a python dictionary, with some extended functions.

    Scrapy’s design, every time an Item is generated, it can be passed to the pipeline for processing. What you wrote in it is looping over the keys of the item dictionary. The keys should be strings. If you use the __getitem__ syntax, you will be prompted to use numbers instead of numbers. for tmp in item

    reply
    0
  • 高洛峰

    高洛峰2017-04-18 10:33:25

    You can put one item看作一个字典,实际它就是dict类的派生类。你在pipeline里对这个item直接遍历,取到的tmp实际是都是字典的键,类型是字符串,所以tmp['pm25']这种操作报出TypeError:string类型的对象索引必须是int型.

    reply
    0
  • Cancelreply