Home >Backend Development >Python Tutorial >Douban movie image crawling example
Official website:.
Installation command: pip install Scrapy
## Installation completed, use the default Create a new project from the template, command: scrapy startproject xx
The above picture vividly illustrates the operating mechanism of scrapy. The specific meaning and function of each part can be found on Baidu, so I won’t go into details here. Generally, what we need to do is the following steps.
# 1) Configure settings. For other configurations, you can view the document configuration according to your own requirements.
DEFAULT_REQUEST_HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.10 Safari/537.36'} DOWNLOAD_TIMEOUT = 30IMAGES_STORE = 'Images'
2) Define the items class, which is equivalent to the Model class. For example:
class CnblogImageItem(scrapy.Item): image = scrapy.Field() imagePath = scrapy.Field() name = scrapy.Field()3) Configure the download middleware. The function of the download middleware is to customize how to send a request. Generally, there are middleware for handling agents, middleware for PhantomJs, etc. Here, we only use proxy middleware.
class GaoxiaoSpiderMiddleware(object):def process_request(self, request, spider):if len(request.flags) > 0 and request.flags[0] == 'img':return None driver = webdriver.PhantomJS()# 设置全屏 driver.maximize_window() driver.get(request.url) content = driver.page_source driver.quit()return HtmlResponse(request.url, encoding='utf-8', body=content)class ProxyMiddleWare(object):def process_request(self, request, spider): request.meta['proxy'] = 'http://175.155.24.103:808'4) Write a pipeline, which is used to process items passed from Spider, save excel, database, download pictures, etc. Here is my code for downloading images, using the official image downloading framework.
class CnblogImagesPipeline(ImagesPipeline): IMAGES_STORE = get_project_settings().get("IMAGES_STORE")def get_media_requests(self, item, info): image_url = item['image']if image_url != '':yield scrapy.Request(str(image_url), flags=['img'])def item_completed(self, result, item, info): image_path = [x["path"] for ok, x in result if ok]if image_path:# 重命名if item['name'] != None and item['name'] != '': ext = os.path.splitext(image_path[0])[1] os.rename(self.IMAGES_STORE + '/' + image_path[0], self.IMAGES_STORE + '/' + item['name'] + ext) item["imagePath"] = image_pathelse: item['imagePath'] = ''return item5) Write your own Spider class. The role of Spider is to configure some information, initiate url requests, and process response data. The download middleware configuration and pipeline here can be placed in the settings file. Here I put them in their respective spiders. Because the project contains multiple spiders, and they use different download middleware, they are configured separately.
# coding=utf-8import sysimport scrapyimport gaoxiao.itemsimport json reload(sys) sys.setdefaultencoding('utf-8')class doubanSpider(scrapy.Spider): name = 'douban'allowed_domains = ['movie.douban.com'] baseUrl = ''start = 0 start_urls = [baseUrl + str(start)] custom_settings = {'DOWNLOADER_MIDDLEWARES': {'gaoxiao.middlewares.ProxyMiddleWare': 1,# 'gaoxiao.middlewares.GaoxiaoSpiderMiddleware': 544 },'ITEM_PIPELINES': {'gaoxiao.pipelines.CnblogImagesPipeline': 1, } }def parse(self, response): data = json.loads(response.text)['subjects']for i in data: item = gaoxiao.items.CnblogImageItem()if i['cover'] != '': item['image'] = i['cover'] item['name'] = i['title']else: item['image'] = ''yield itemif self.start < 400: self.start += 20yield scrapy.Request(self.baseUrl + str(self.start), callback=self.parse)
The above is the detailed content of Douban movie image crawling example. For more information, please follow other related articles on the PHP Chinese website!