Home > Article > Web Front-end > The whole process of making a crawler with NodeJS_node.js
Today, let’s learn alsotang’s crawler tutorial, and then follow the simple crawling of CNode.
Create project craelr-demo
We first create an Express project, and then delete all the contents of the app.js file, because we do not need to display the content on the Web for the time being. Of course, we can also directly npm install express
in an empty folder to use the Express functions we need.
Target website analysis
As shown in the picture, this is a part of the div tag on the CNode homepage. We use this series of ids and classes to locate the information we need.
Use superagent to obtain source data
superagent is an Http library used by ajax API. Its usage is similar to jQuery. We initiate a get request through it and output the result in the callback function.
Its res result is an object containing target url information, and the website content is mainly in its text (string).
Use cheerio to parse
cheerio acts as a server-side jQuery function. We first use its .load() to load HTML, and then filter elements through CSS selector.
The result is an object. Call the .each(function(index, element))
function to traverse each object and return HTML DOM Elements.
The result of outputting console.log($element.attr('title'));
is 广州 2014年12月06日 NodeParty 之 UC 场
Titles like console.log($element.attr('href'));
are output as urls like /topic/545c395becbcb78265856eb2
. Then use the url.resolve() function of NodeJS1 to complete the complete url.
Use eventproxy to concurrently crawl the content of each topic
The tutorial shows examples of deeply nested (serial) methods and counter methods. Eventproxy uses event (parallel) methods to solve this problem. When all the crawling is completed, eventproxy receives the event message and automatically calls the processing function for you.
The results are as follows
Extended Exercise (Challenge)
Get message username and points
Find the class name of the user who commented in the source code of the article page. The classname is reply_author. As you can see from the first element of console.log $('.reply_author').get(0)
, everything we need to get is here.
First, let’s crawl an article and get everything we need at once.
We can capture points information through https://cnodejs.org/user/username
On the user information page $('.big').text().trim()
is the points information.
Use cheerio’s function .get(0) to get the first element.
This is just a capture of a single article, there are still 40 that need to be modified.