Catch all news inquiries from the official website of Sichuan University School of Public Administration ().
Experimental process
1. Determine the crawling target.
2. Develop crawling rules.
3. 'Write/Debug' crawling rules.
4. Obtain crawling data
1. Determine the crawling target
The target we need to crawl this time is Sichuan All the news and information of the University School of Public Administration. So we need to know the layout structure of the official website of the School of Public Administration.

##WeChat screenshot_20170515223045.png

2. Formulate the crawling rules
Through the analysis of the first part, we You will think that if we want to capture the specific information of a news, we need to click from the news page to enter the news details page to capture the specific content of the news. Let's click on a news to try it

We found that we can directly grab the data we need on the news details page: title, time, content.URL.
Okay , now we have a clear idea of grabbing a piece of news. But, how to crawl all the news content?
This is obviously not difficult for us.
So to sort out the ideas, we can think of An obvious crawling rule:
3.'Writing/Debugging' Crawl rules
In order to make the granularity of debugging the crawler as small as possible, I will combine the writing and debugging modules together.
In the crawler, I will implement the following functional points:2. Enter the news details through the crawled news link on the page to crawl the required data (mainly news content)1.Crawl out the basic data under a page.3 .Crawl all the news through a loop.
The corresponding knowledge points are:
2 .Crawl twice through the crawled data.3.Crawl all data on the web page through a loop.
Without further ado, let’s get started now.
3.1 Climb out all the news links under the news column on one page

Through the source code analysis of the news column, We found that the structure of the captured data is

Then we only need to position the crawler's selector to (li:newsinfo_box_cf ), and then grab it in a for loop.
Writing code
import scrapyclass News2Spider(scrapy.Spider): name = "news_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"): url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())
Test, passed!

3.2 Enter the news details through the crawled news link to crawl the required data (mainly news content)
Now I have obtained a set of URLs, and now I need to enter To capture the title, time and content I need from each URL, the code implementation is quite simple. When the original code captures a URL, I only need to enter the URL and capture the corresponding data. Therefore, I only need to Write a crawling method to enter the news details page, and use scapy.request to call it.
Writing code
#进入新闻详情页的抓取方法 def parse_dir_contents(self, response):item = GgglxyItem()item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()item['href'] = responseitem['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first() data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")item['content'] = data[0].xpath('string(.)').extract()[0] yield item
After integrating into the original code, there are:
import scrapyfrom ggglxy.items import GgglxyItemclass News2Spider(scrapy.Spider): name = "news_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"): url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())#调用新闻抓取方法yield scrapy.Request(url, callback=self.parse_dir_contents)#进入新闻详情页的抓取方法 def parse_dir_contents(self, response): item = GgglxyItem() item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first() item['href'] = response item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first() data = response.xpath("//div[@class='detail_zy_c pb30 mb30']") item['content'] = data[0].xpath('string(.)').extract()[0]yield item
Test, passed!

At this time we add a loop:
NEXT_PAGE_NUM = 1 NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM yield scrapy.Request(next_url, callback=self.parse)
Add to the original code :
import scrapyfrom ggglxy.items import GgglxyItem NEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider): name = "news_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"): URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())yield scrapy.Request(URL, callback=self.parse_dir_contents)global NEXT_PAGE_NUM NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11: next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUMyield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response): item = GgglxyItem() item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first() item['href'] = response item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first() data = response.xpath("//div[@class='detail_zy_c pb30 mb30']") item['content'] = data[0].xpath('string(.)').extract()[0] yield item
Test:

抓到的数量为191,但是我们看官网发现有193条新闻,少了两条.
为啥呢?我们注意到log的error有两条:
定位问题:原来发现,学院的新闻栏目还有两条隐藏的二级栏目:
比如:

对应的URL为

URL都长的不一样,难怪抓不到了!
那么我们还得为这两条二级栏目的URL设定专门的规则,只需要加入判断是否为二级栏目:
if URL.find('type') != -1: yield scrapy.Request(URL, callback=self.parse)
组装原函数:
import scrapy from ggglxy.items import GgglxyItem NEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider): name = "news_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"): URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())if URL.find('type') != -1:yield scrapy.Request(URL, callback=self.parse)yield scrapy.Request(URL, callback=self.parse_dir_contents) global NEXT_PAGE_NUM NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11: next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUMyield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response): item = GgglxyItem() item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first() item['href'] = response item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first() data = response.xpath("//div[@class='detail_zy_c pb30 mb30']") item['content'] = data[0].xpath('string(.)').extract()[0] yield item
测试:

我们发现,抓取的数据由以前的193条增加到了238条,log里面也没有error了,说明我们的抓取规则OK!
4.获得抓取数据
<code class="haxe"> scrapy crawl <span class="hljs-keyword">new<span class="hljs-type">s_info_2 -o <span class="hljs-number">0016.json</span></span></span></code><br/><br/>
The above is the detailed content of Scrapy captures college news report examples. For more information, please follow other related articles on the PHP Chinese website!

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Dreamweaver CS6
Visual web development tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Zend Studio 13.0.1
Powerful PHP integrated development environment

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool