Home >Web Front-end >HTML Tutorial >First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css_WEB-ITnose

First introduction to scrapy, actual combat of image crawling on Mekong.com_html/css_WEB-ITnose

WBOY
WBOYOriginal
2016-06-24 11:53:111645browse

I have been studying the scrapy crawler framework in the past two days, and I am planning to write a crawler to practice. What I usually do more often is browse pictures, yes, that’s right, those are artistic photos. I proudly believe that looking at more beautiful photos will definitely improve your aesthetics and become an elegant programmer. O(∩_∩)O~ Just kidding, so without further ado, let’s get to the point and write an image crawler.

Design idea: The crawl target is the model photos of Meikong.com, use CrawlSpider to extract the URL address of each photo, and write the extracted image URL into a static html text for storage, and you can open it to view the image. My environment is win8.1, python2.7 Scrapy 0.24.4. I won’t tell you how to configure the environment. You can search it on Baidu yourself.

Referring to the official documentation, I summarized the four steps to build a crawler program:

  • Create a scrapy project
  • Define the element items that need to be extracted from the web page
  • Implement a spider class to complete the function of crawling URLs and extracting items through the interface
  • Implement an item pipeline class to complete the storage function of Items.
  • The next step is very simple. Just follow the steps step by step. First, create a project in the terminal. Let’s name the project moko. Enter the command scrapy startproject moko. Scrapy will create a moko file directory in the current directory. There are some initial files in it. If you are interested in the use of the files, check out the documentation. I will mainly introduce the files we used this time.

    Define Item Define the data we want to capture in items.py:

    # -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MokoItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    url = scrapy.Field()

  • The url here is used The dict number that stores the final result will be explained later. The name is randomly named. For example, if I also need to crawl the name of the author of the picture, then we can add a name = scrapy.Field(), and so on.
  • Next we enter the spiders folder and create a python file in it. Let’s take the name mokospider.py and add the core code to implement Spider:
  • Spider is a script inherited from scrapy The Python class of .contrib.spiders.CrawlSpider has three required defined members

    name: The name, the identifier of this spider, must be unique. Different crawlers define different names

    start_urls: a list of URLs, the spider starts crawling from these web pages

    parse(): parsing method, when calling, pass in the Response object returned from each URL as the only parameter, responsible for parsing and matching the crawl Get the data (parsed into items) and track more URLs.

  • # -*- coding: utf-8 -*-#File name :spyders/mokospider.py#Author:Jhonny Zhang#mail:veinyy@163.com#create Time : 2014-11-29#############################################################################from scrapy.contrib.spiders import CrawlSpider,Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom moko.items import MokoItemimport refrom scrapy.http import Requestfrom scrapy.selector import Selectorclass MokoSpider(CrawlSpider):    name = "moko"    allowed_domains = ["moko.cc"]    start_urls=["http://www.moko.cc/post/aaronsky/list.html"]    rules = (Rule(SgmlLinkExtractor(allow=('/post/\d*\.html')),  callback = 'parse_img', follow=True),)    def parse_img(self, response):        urlItem = MokoItem()        sel = Selector(response)        for divs in sel.xpath('//div[@class="pic dBd"]'):            img_url=divs.xpath('.//img/@src2').extract()[0]            urlItem['url'] = img_url            yield urlItem

    Our project is named moko. The allowed_domains area allowed by the crawler is limited to moko.cc, which is the restricted area of ​​the crawler. It stipulates that the crawler only crawls web pages under this domain name. The starting address of the crawler starts from http://www.moko.cc/post/aaronsky/list.html. Then set the crawling rule Rule. This is what makes CrawlSpider different from basic crawlers. For example, we start crawling from web page A. There are many hyperlink URLs on web page A. Our crawler will proceed based on the set rules. Crawl the hyperlink URLs that comply with the rules, and repeat this process. The callback function is used when a web page calls this callback function. The reason why I did not use the default name of parse is because the official documentation says that parse may be called in the crawler framework, causing conflicts.

    There are many links to pictures on the target http://www.moko.cc/post/aaronsky/list.html webpage. The links to each picture have rules to follow. For example, just click on one to open it. http://www.moko.cc/post/1052776.html, http://www.moko.cc/post/ here are all the same, and the different parts of each link are the numbers behind them. So here we use regular expressions to fill in the rules rules = (Rule(SgmlLinkExtractor(allow=('/post/d*.html')), callback = 'parse_img', follow=True),) refers to the current web page, all matches All web pages with the suffix /post/d*.html are crawled and processed by calling parse_img.

    Next, define the parsing function parse_img. This is more critical. The parameter it passes in is the response object returned by the crawler after opening the URL. The content in the response object is simply a large string. We are using the crawler Filter out what we need. How to filter it? ? ? Haha, there is an awesome Selector method that uses its xpath() path expression formula to parse the content. Before parsing, you need to analyze the web page in detail. The tool we use here is firebug. The intercepted web core code is

      我们需要的是src2部分!他在b54a6d2f211466cf9434ded6b4cb3a54标签下的a1f02c36ba31691bcfe87b2722de723b里面, 首先实例一个在Items.py里面定义的MokoItem()的对象urlItem,用牛逼的Selector传入response,我这里用了一个循环,每次处理一个url,利用xpath路径表达式解析取出url,至于xpath如何用,自行百度下。结果存储到urlItem里面,这里用到了我们Items.py里面定义的url了!

          然后定义一下pipelines,这部分管我们的内容存储。

    from moko.items import MokoItemclass MokoPipeline(object):    def __init__(self):        self.mfile = open('test.html', 'w')    def process_item(self, item, spider):        text = '<img src="' + item['url'] + '" alt = "" />'        self.mfile.writelines(text)    def close_spider(self, spider):        self.mfile.close()

     

         建立一个test.html文件用来存储结果。注意我的process_item里用到了一些html规则,作用是直接在html里面显示图片。结尾在定义一个关闭文件的方法,在爬虫结束时候调用。

         最后定义设置一下settings.py

    BOT_NAME = 'moko'SPIDER_MODULES = ['moko.spiders']NEWSPIDER_MODULE = 'moko.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'moko (+http://www.yourdomain.com)'ITEM_PIPELINES={'moko.pipelines.MokoPipeline': 1,}     

     


     

          最后展示一下效果图吧,祝各位玩的快乐 ^_^

                   

    Statement:
    The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn