Home  >  Article  >  Backend Development  >  Scrapy captures college news report examples

Scrapy captures college news report examples

PHP中文网
PHP中文网Original
2017-06-21 10:47:111682browse

Catch all news inquiries from the official website of Sichuan University School of Public Administration ().

Experimental process

1. Determine the crawling target.
2. Develop crawling rules.
3. 'Write/Debug' crawling rules.
4. Obtain crawling data

1. Determine the crawling target

The target we need to crawl this time is Sichuan All the news and information of the University School of Public Administration. So we need to know the layout structure of the official website of the School of Public Administration.


##WeChat screenshot_20170515223045.png
Here we found that if we want to capture all the news information, we cannot capture it directly on the homepage of the official website. We need to click "more" to enter the general news column.

##Paste_Image .png
We saw the specific news column, but this obviously does not meet our crawling needs: the current news dynamic webpage can only crawl the time, title and URL of the news, but It cannot capture the content of the news. So we want to go to the news details page to capture the specific content of the news.


2. Formulate the crawling rules

Through the analysis of the first part, we You will think that if we want to capture the specific information of a news, we need to click from the news page to enter the news details page to capture the specific content of the news. Let's click on a news to try it

Paste_Image.png
We found that we can directly grab the data we need on the news details page: title, time, content.URL.

Okay , now we have a clear idea of ​​​​grabbing a piece of news. But, how to crawl all the news content?

This is obviously not difficult for us.


We can see the page jump button at the bottom of the news column. Then we can grab all the news through the "Next Page" button.


So to sort out the ideas, we can think of An obvious crawling rule:

Catch all the news links under the 'news section' and go to the news details link to grab all the news content.


3.'Writing/Debugging' Crawl rules

In order to make the granularity of debugging the crawler as small as possible, I will combine the writing and debugging modules together.

In the crawler, I will implement the following functional points:


1. Climb out all the news links under the news column on one page
2. Enter the news details through the crawled news link on the page to crawl the required data (mainly news content)

3 .Crawl all the news through a loop.

The corresponding knowledge points are:

1.Crawl out the basic data under a page.
2 .Crawl twice through the crawled data.

3.Crawl all data on the web page through a loop.

Without further ado, let’s get started now.

3.1 Climb out all the news links under the news column on one page

Paste_Image.png
Through the source code analysis of the news column, We found that the structure of the captured data is

Paste_Image.png
Then we only need to position the crawler's selector to (li:newsinfo_box_cf ), and then grab it in a for loop.

Writing code

import scrapyclass News2Spider(scrapy.Spider):
    name = "news_info_2"
    start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
    ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"):
            url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())
Test, passed!

Paste_Image.png
3.2 Enter the news details through the crawled news link to crawl the required data (mainly news content)

Now I have obtained a set of URLs, and now I need to enter To capture the title, time and content I need from each URL, the code implementation is quite simple. When the original code captures a URL, I only need to enter the URL and capture the corresponding data. Therefore, I only need to Write a crawling method to enter the news details page, and use scapy.request to call it.

Writing code

#进入新闻详情页的抓取方法
def parse_dir_contents(self, response):item = GgglxyItem()item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()item['href'] = responseitem['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()
        data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")item['content'] = data[0].xpath('string(.)').extract()[0]
        yield item
After integrating into the original code, there are:
import scrapyfrom ggglxy.items import GgglxyItemclass News2Spider(scrapy.Spider):
    name = "news_info_2"
    start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
    ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"):
            url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())#调用新闻抓取方法yield scrapy.Request(url, callback=self.parse_dir_contents)#进入新闻详情页的抓取方法                def parse_dir_contents(self, response):
            item = GgglxyItem()
            item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()
            item['href'] = response
            item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()
            data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")
            item['content'] = data[0].xpath('string(.)').extract()[0]yield item

Test, passed!

Paste_Image.png
At this time we add a loop:
NEXT_PAGE_NUM = 1 

NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:next_url = &#39;http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s&#39; % NEXT_PAGE_NUM
            yield scrapy.Request(next_url, callback=self.parse)

Add to the original code :

import scrapyfrom ggglxy.items import GgglxyItem

NEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider):
    name = "news_info_2"
    start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
    ]def parse(self, response):for href in response.xpath("//div[@class=&#39;newsinfo_box cf&#39;]"):
            URL = response.urljoin(href.xpath("div[@class=&#39;news_c fr&#39;]/h3/a/@href").extract_first())yield scrapy.Request(URL, callback=self.parse_dir_contents)global NEXT_PAGE_NUM
        NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:
            next_url = &#39;http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s&#39; % NEXT_PAGE_NUMyield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response):
            item = GgglxyItem() 
            item[&#39;date&#39;] = response.xpath("//div[@class=&#39;detail_zy_title&#39;]/p/text()").extract_first()
            item[&#39;href&#39;] = response 
            item[&#39;title&#39;] = response.xpath("//div[@class=&#39;detail_zy_title&#39;]/h1/text()").extract_first()
            data = response.xpath("//div[@class=&#39;detail_zy_c pb30 mb30&#39;]")
            item[&#39;content&#39;] = data[0].xpath(&#39;string(.)&#39;).extract()[0] yield item

Test:

Paste_Image.png

抓到的数量为191,但是我们看官网发现有193条新闻,少了两条.
为啥呢?我们注意到log的error有两条:
定位问题:原来发现,学院的新闻栏目还有两条隐藏的二级栏目:
比如:


Paste_Image.png


对应的URL为


Paste_Image.png


URL都长的不一样,难怪抓不到了!
那么我们还得为这两条二级栏目的URL设定专门的规则,只需要加入判断是否为二级栏目:

  if URL.find(&#39;type&#39;) != -1:      yield scrapy.Request(URL, callback=self.parse)

组装原函数:

import scrapy
from ggglxy.items import GgglxyItem

NEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider):
    name = "news_info_2"
    start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
    ]def parse(self, response):for href in response.xpath("//div[@class=&#39;newsinfo_box cf&#39;]"):
            URL = response.urljoin(href.xpath("div[@class=&#39;news_c fr&#39;]/h3/a/@href").extract_first())if URL.find(&#39;type&#39;) != -1:yield scrapy.Request(URL, callback=self.parse)yield scrapy.Request(URL, callback=self.parse_dir_contents)
        global NEXT_PAGE_NUM
        NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:
            next_url = &#39;http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s&#39; % NEXT_PAGE_NUMyield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response):
            item = GgglxyItem() 
            item[&#39;date&#39;] = response.xpath("//div[@class=&#39;detail_zy_title&#39;]/p/text()").extract_first()
            item[&#39;href&#39;] = response 
            item[&#39;title&#39;] = response.xpath("//div[@class=&#39;detail_zy_title&#39;]/h1/text()").extract_first()
            data = response.xpath("//div[@class=&#39;detail_zy_c pb30 mb30&#39;]")
            item[&#39;content&#39;] = data[0].xpath(&#39;string(.)&#39;).extract()[0] yield item

测试:


Paste_Image.png

我们发现,抓取的数据由以前的193条增加到了238条,log里面也没有error了,说明我们的抓取规则OK!

4.获得抓取数据

<code class="haxe">     scrapy crawl <span class="hljs-keyword">new<span class="hljs-type">s_info_2 -o <span class="hljs-number">0016.json</span></span></span></code><br/><br/>

The above is the detailed content of Scrapy captures college news report examples. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn