Home >Backend Development >XML/RSS Tutorial >RSS and crawlers, detailed explanation of how to collect data
Abstract: Before the value of data can be mined, it must first go through processes such as collection, storage, analysis and calculation. Obtaining comprehensive and accurate data is the basis for data value mining. This issue of CSDN Cloud Computing Club's "Big Data Story" will start with the most common data collection methods - RSS and search engine crawlers.
On December 30, the CSDN Cloud Computing Club event was held at 3W Coffee. The theme of the event was "RSS and Crawlers: The Story of Big Data - Starting with How to Collect Data." Before the value of data can be mined, it must first go through processes such as collection, storage, analysis and calculation. Obtaining comprehensive and accurate data is the basis for data value mining. Perhaps the current data cannot bring actual value to the enterprise or organization, but as a far-sighted decision-maker, you should realize that important data should be collected and saved as early as possible. Data is wealth. This issue of "Big Data Story" will start with the most common data collection methods-RSS and search engine crawlers.
The event site was packed with people
First of all, Cui Kejun, general manager of the Library Division of Beijing Wanfang Software Co., Ltd., shared the theme of "Large-scale implementation of RSS Initial applications of aggregation and website downloading in scientific research.” Cui Kejun has worked in the library and information industry for 12 years and has rich experience in data collection. He mainly shared RSS, an important way of information aggregation, and its implementation technology.
RSS (Really Simple Syndication) is a source format specification used to aggregate websites that frequently publish updated data, such as blog posts, news, audio or video excerpts. RSS files contain full text or excerpted text, plus excerpted data and authorization metadata from the network to which the user subscribes.
The aggregation of hundreds or even thousands of RSS seeds closely related to a certain industry will enable a quick and comprehensive understanding of the latest developments in a certain industry; By downloading complete data from a website and conducting data mining, you will be able to understand the ins and outs of the development of a certain topic in the industry.
Cui Kejun, General Manager of the Library Business Department of Beijing Wanfang Software Co., Ltd.
Cui Kejun introduced the role of RSS in the Institute of High Energy Physics as an example. Applications in scientific research institutes. High-energy physics information monitoring targets high-energy physics peer institutions around the world: laboratories, industry societies, international associations, government agencies in charge of scientific research in various countries, key comprehensive scientific publications, high-energy physics experimental projects and experimental facilities. The types of information monitored are: news, papers, conference reports, analysis and reviews, preprints, case studies, multimedia, books, recruitment information, etc.
High energy physics literature information adopts the most advanced open source content management system Drupal, open source search technology Apache Solr, as well as PubSubHubbub technology developed by Google employees to subscribe to news in real time and Amazon's OpenSearch to establish a set of high energy physics The information monitoring system is different from traditional RSS subscription and push, and realizes almost real-time information capture and active push of news of any keyword, any category, and compound conditions.
Next, Cui Kejun shared his experience in using technologies such as Drupal, Apache Solr, PubSubHubbub and OpenSearch.
Next, Ye Shunping, the head of the crawler group of the architect of the Search Department of Yisou Technology, gave a sharing titled "Web Search Crawler Timeliness System", including the main goals, architecture, and various sub-modules of the timeliness system. design plan.
Yisou Technology Search Department Architect and Head of Crawl Team Ye Shunping
The goals of web crawlers are high coverage and low dead link rate As with good effectiveness, the goal of the crawler effectiveness system is similar, mainly to achieve rapid and comprehensive inclusion of new web pages. The following figure shows the overall architecture of the timeliness system:
Among them, the first one above is the RSS/sitemap subsystem, and the next one is the Webmain scheduler, the scheduling system for web page crawling. , and then a timeliness module Vertical Scheduler. The far left is the DNS service. When crawling, there are usually dozens or even hundreds of crawling clusters. If each one is protected, the pressure on DNS will be relatively high. Large, so there is usually a DNS service module to provide global services. After the data is captured, subsequent data processing is generally performed.
The modules related to effectiveness include the following:
RSS/sitemap system: The process of using RSS/sitemap by the timeliness system is to mine seeds, crawl regularly, and analyze the link release time. Crawl and index newer web pages first.
Pan-crawling system: If the pan-crawling system is well designed, it will help improve the high coverage of time-sensitive web pages, but pan-crawling needs to shorten the scheduling cycle as much as possible.
Seed scheduling system: It is mainly a time-sensitive seed library. There is some information in this seed library. The scheduling system will continuously scan the database and then send it to the crawling cluster. After the cluster is crawled, some links will be extracted. Process, and then send these out by category, and each vertical channel will obtain timely data.
Seed mining: involves page parsing or other mining methods, which can be constructed through site maps and navigation bars, and based on page structural characteristics and page change rules.
Seed update mechanism: record the crawl history and follow link information of each seed, regularly recalculate the update cycle of the seed based on the external link update characteristics of the seed.
Crawling system and JavaScript parsing: Use the browser to crawl and build a crawling cluster based on browser crawling. Or adopt an open source project such as Qtwebkit.
The above is the detailed content of RSS and crawlers, detailed explanation of how to collect data. For more information, please follow other related articles on the PHP Chinese website!