>  Q&A  >  본문

python - scrapy pipeline报错求助

由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return的东西传去哪了?
是不是下面代码的item?

class pipeline :
    def process_item(self, item, spider):

我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点

spider:

# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re


class InfospSpider(scrapy.Spider):
    name = "infosp"
    allowed_domains = ["pm25.com"]
    start_urls = ['http://www.pm25.com/rank/1day.html', ]

    def parse(self, response):
        item = Pm25Item()
        re_time = re.compile("\d+-\d+-\d+")
        date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE
        # items = []

        selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围
        for subselector in selector: #通过范围逐条解析
            try: #防止[0]报错
                rank = subselector.xpath("span[1]/text()").extract()[0] 
                quality = subselector.xpath("span/em/text()")[0].extract()
                city = subselector.xpath("a/text()").extract()[0]
                province = subselector.xpath("span[3]/text()").extract()[0]
                aqi = subselector.xpath("span[4]/text()").extract()[0]
                pm25 = subselector.xpath("span[5]/text()").extract()[0]
            except IndexError:
                print(rank,quality,city,province,aqi,pm25)

            item['date'] = re_time.findall(date)[0]
            item['rank'] = rank
            item['quality'] = quality
            item['province'] = city
            item['city'] = province
            item['aqi'] = aqi
            item['pm25'] = pm25
            # items.append(item)

            yield item #这里不懂该怎么用,出来的是什么格式,
                       #有的教程会return items,所以希望能得到指点

pipeline:

import time

class Pm25Pipeline(object):

    def process_item(self, item, spider):
        today = time.strftime("%y%m%d",time.localtime())
        fname = str(today) + ".txt"

        with open(fname,"a") as f:
            for tmp in item: #不知道这里是否写的对,
                             #个人理解是spider return出来的item是yiled dict
                             #[{a:1,aa:11},{b:2,bb:22},{......}]
                f.write(tmp["date"] + '\t' +
                        tmp["rank"] + '\t' +
                        tmp["quality"] + '\t' +
                        tmp["province"] + '\t' +
                        tmp["city"] + '\t' +
                        tmp["aqi"] + '\t' +
                        tmp["pm25"] + '\n'
                        )
            f.close()
        return item

items:

import scrapy

class Pm25Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    rank = scrapy.Field()
    quality = scrapy.Field()
    province = scrapy.Field()
    city = scrapy.Field()
    aqi = scrapy.Field()
    pm25 = scrapy.Field()
    pass

部分运行报错代码:

Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
 'city': '新疆',
 'date': '2017-04-02',
 'pm25': '13 ',
 'province': '伊犁哈萨克州',
 'quality': '优',
 'rank': '357'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '西藏',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '林芝',
 'quality': '优',
 'rank': '358'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '丽江',
 'quality': '优',
 'rank': '359'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '15 ',
 'province': '玉溪',
 'quality': '优',
 'rank': '360'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '10 ',
 'province': '楚雄州',
 'quality': '优',
 'rank': '361'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '11 ',
 'province': '迪庆州',
 'quality': '优',
 'rank': '362'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
 'city': '云南',
 'date': '2017-04-02',
 'pm25': '9 ',
 'province': '怒江州',
 'quality': '优',
 'rank': '363'}
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
    tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 38229,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 363,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)

希望能到到各位老师的帮助再次感谢~!

天蓬老师天蓬老师2741일 전567

모든 응답(4)나는 대답할 것이다

  • PHP中文网

    PHP中文网2017-04-18 10:33:25

    직접 작성하면 루프를 수행할 필요가 없습니다. 항목은 생각처럼 목록이 아닌 개별적으로 처리됩니다.

    으아악

    회신하다
    0
  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:33:25

    검색: TypeError: 문자열 인덱스는 정수여야 합니다. 문제가 무엇인지 파악하세요
    행 수를 찾아 문제를 해결하세요

    회신하다
    0
  • 大家讲道理

    大家讲道理2017-04-18 10:33:25

    Scrapy의 Item은 일부 확장된 기능을 포함하여 Python 사전과 유사합니다.

    Scrapy의 설계에서는 항목이 생성될 때마다 처리를 위해 파이프라인으로 전달될 수 있습니다. 여기에 작성한 for tmp in item는 항목 사전의 키를 반복합니다. __getitem__ 구문을 사용하는 경우 숫자 대신 숫자를 사용하라는 메시지가 표시됩니다.

    회신하다
    0
  • 高洛峰

    高洛峰2017-04-18 10:33:25

    item는 실제로 dict 클래스의 파생 클래스인 사전으로 생각할 수 있습니다. pipeline에서 이 item을 직접 순회하는데, 얻은 tmp은 실제로 사전 키이고 유형도 문자열이므로 tmp['pm25'] 이런 연산이 TypeError:string类型的对象索引必须是int型에 보고됩니다.

    회신하다
    0
  • 취소회신하다