Home >Backend Development >Python Tutorial >Douban movie image crawling example

Douban movie image crawling example

PHP中文网
PHP中文网Original
2017-06-20 15:26:401992browse

1. Get the effect first

 

2. Install Scrapy and use

Official website:.

Installation command: pip install Scrapy

## Installation completed, use the default Create a new project from the template, command: scrapy startproject xx

 

The above picture vividly illustrates the operating mechanism of scrapy. The specific meaning and function of each part can be found on Baidu, so I won’t go into details here. Generally, what we need to do is the following steps.

# 1) Configure settings. For other configurations, you can view the document configuration according to your own requirements.

DEFAULT_REQUEST_HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.10 Safari/537.36'}
DOWNLOAD_TIMEOUT = 30IMAGES_STORE = 'Images'

 2) Define the items class, which is equivalent to the Model class. For example:

class CnblogImageItem(scrapy.Item):
    image = scrapy.Field()
    imagePath = scrapy.Field()
    name = scrapy.Field()
 3) Configure the download middleware. The function of the download middleware is to customize how to send a request. Generally, there are middleware for handling agents, middleware for PhantomJs, etc. Here, we only use proxy middleware.

class GaoxiaoSpiderMiddleware(object):def process_request(self, request, spider):if len(request.flags) > 0 and request.flags[0] == 'img':return None
        driver = webdriver.PhantomJS()# 设置全屏        driver.maximize_window()
        driver.get(request.url)
        content = driver.page_source
        driver.quit()return HtmlResponse(request.url, encoding='utf-8', body=content)class ProxyMiddleWare(object):def process_request(self, request, spider):
        request.meta['proxy'] = 'http://175.155.24.103:808'
 4) Write a pipeline, which is used to process items passed from Spider, save excel, database, download pictures, etc. Here is my code for downloading images, using the official image downloading framework.

class CnblogImagesPipeline(ImagesPipeline):
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")def get_media_requests(self, item, info):
        image_url = item['image']if image_url != '':yield scrapy.Request(str(image_url), flags=['img'])def item_completed(self, result, item, info):
        image_path = [x["path"] for ok, x in result if ok]if image_path:# 重命名if item['name'] != None and item['name'] != '':
                ext = os.path.splitext(image_path[0])[1]
                os.rename(self.IMAGES_STORE + '/' +  image_path[0], self.IMAGES_STORE + '/' + item['name'] + ext)
            item["imagePath"] = image_pathelse:
            item['imagePath'] = ''return item
 5) Write your own Spider class. The role of Spider is to configure some information, initiate url requests, and process response data. The download middleware configuration and pipeline here can be placed in the settings file. Here I put them in their respective spiders. Because the project contains multiple spiders, and they use different download middleware, they are configured separately.

# coding=utf-8import sysimport scrapyimport gaoxiao.itemsimport json
reload(sys)
sys.setdefaultencoding('utf-8')class doubanSpider(scrapy.Spider):
    name = 'douban'allowed_domains = ['movie.douban.com']
    baseUrl = ''start = 0
    start_urls = [baseUrl + str(start)]
    custom_settings = {'DOWNLOADER_MIDDLEWARES': {'gaoxiao.middlewares.ProxyMiddleWare': 1,#             'gaoxiao.middlewares.GaoxiaoSpiderMiddleware': 544        },'ITEM_PIPELINES': {'gaoxiao.pipelines.CnblogImagesPipeline': 1,
        }
    }def parse(self, response):
        data = json.loads(response.text)['subjects']for i in data:
            item = gaoxiao.items.CnblogImageItem()if i['cover'] != '':
                item['image'] = i['cover']
                item['name'] = i['title']else:
                item['image'] = ''yield itemif self.start < 400:
            self.start += 20yield scrapy.Request(self.baseUrl + str(self.start), callback=self.parse)

The above is the detailed content of Douban movie image crawling example. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn