Home  >  Article  >  Web Front-end  >  How to implement http crawler in node

How to implement http crawler in node

亚连
亚连Original
2018-06-12 15:04:121388browse

This article mainly introduces the sample code of the http crawler based on node. Now I share it with you and give it as a reference.

Every moment, whether you sleep or not, there will be massive data coming and going on the Internet, from customer service to server, and from server to server. The role completed by http's get and request is to obtain and submit data. Next, we start writing a simple small crawler to crawl the course interface of the chapter about node in the novice tutorial.

Crawl all the data on the home page of the Node.js tutorial

Create node-http.js, the code is as follows, there are detailed comments in the code, you can understand it by yourself Ha

var http=require('http');//获取http模块
var url='http://www.runoob.com/nodejs/nodejs-tutorial.html';//定义node官网地址变量

http.get(url,function(res){
  var html='';

  // 这里将会触发data事件,不断触发不断跟新html直至完毕
  res.on('data',function(data){
    html +=data
  })

  // 当数据获取完成将会触发end事件,这里将会打印初node官网的html
  res.on('end',function(){
    console.log(html)
  })
}).on('error',function(){
  console.log('获取node官网相关数据出错')
})

In the terminal execution result, it was found that all the HTML of this page has been crawled down

G:\node\node-http> node node-http.js
<!Doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta property="qc:admins" content="465267610762567726375" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Node.js 教程 | 菜鸟教程</title>
<link rel=&#39;dns-prefetch&#39; href=&#39;//s.w.org&#39; />
<link rel="canonical" href="http://www.runoob.com/nodejs/nodejs-tutorial.html" />
<meta name="keywords" content="Node.js 教程,node,Node.js,nodejs">
<meta name="description" content="Node.js 教程  简单的说 Node.js 就是运行在服务端的 JavaScript。 Node.js 是一个基于Chrome JavaScript 运行时建立的一个平台
。 Node.js是一个事件驱动I/O服务端JavaScript环境,基于Google的V8引擎,V8引擎执行Javascript的速度非常快,性能非常好。  谁适合阅读本教程? 如果你是一个前端程序员,你不懂得像PHP、Python或Ruby等动态编程语言,..">
<link rel="shortcut icon" href="//static.runoob.com/images/favicon.ico" rel="external nofollow" rel="external nofollow" mce_href="//static.runoob.com/images/favicon.ico" rel="external nofollow" rel="external nofollow" type="image/x-icon">
<link rel="stylesheet" href="/wp-content/themes/runoob/style.css?v=1.141" rel="external nofollow" type="text/css" media="all" />
<link rel="stylesheet" href="//cdn.bootcss.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="external nofollow" media="all" />
<!--[if gte IE 9]><!-->
。。。。。。。。。。
这里只展示部分不然你半天看不到头

Of course crawling HTML is of no use to us, now we need to do some filtering, such as this In the node tutorial, I want to know what the course catalog is, so that I can choose what I am interested in and learn. Let’s go directly to the code:

But before that we need to download the cheerio module (cheerio is nodejs’ page crawling module, specially customized for the server, a fast, flexible and implemented jQuery core implementation. Suitable for all kinds of Web crawler program.) You can search for the detailed introduction by yourself. The use of cheerio is very similar to the use of jquery, so you don’t have to worry about getting started.

PS G:\node\node-http> npm install cheerio

Create node-http-more.js, the code is as follows:

var http=require(&#39;http&#39;);//获取http模块
var cheerio=require(&#39;cheerio&#39;);//引入cheerio模块
var url=&#39;http://www.runoob.com/nodejs/nodejs-tutorial.html&#39;;//定义node官网地址变量
// filer node chapter
function filerNodeChapter(html){
  // 将爬取得HTML装载起来
  var $=cheerio.load(html);
  // 拿到左侧边栏的每个目录
  var nodeChapter=$(&#39;#leftcolumn a&#39;);
  //这里我希望我能获取的到的最终数据格式这个样子的,如此我们能知道每个目录的地址及标题
  /**
   * [{id:,title:}]
   */
  var chapterData=[];
  nodeChapter.each(function(item){
    // 获取每项的地址及标题
    var id=$(this).attr(&#39;href&#39;);
    var title=$(this).text();
    chapterData.push({
      id:id,
      title:title
    })
  })

  return chapterData;

}

//获取每个数据
function getChapterData(nodeChapter){
  nodeChapter.forEach(function(item){
    console.log(&#39; 【 &#39;+item.id+&#39; 】&#39;+item.title+&#39;\n&#39;)
  });
}

http.get(url,function(res){
  var html=&#39;&#39;;

  // 这里将会触发data事件,不断触发不断跟新html直至完毕
  res.on(&#39;data&#39;,function(data){
    html +=data
  })

  // 当数据获取完成将会触发end事件,这里将会打印初node官网的html
  res.on(&#39;end&#39;,function(){
    //console.log(html)
    // 过滤出node.js的课程目录
    var nodeChapter= filerNodeChapter(html);

    //循环打印所获取的数据
    getChapterData(nodeChapter)
  })
}).on(&#39;error&#39;,function(){
  console.log(&#39;获取node官网相关数据出错&#39;)
})

Terminal execution results and print out the course directory

G:\node\node-http> node node-http-more.js
 【 /nodejs/nodejs-tutorial.html 】
Node.js 教程

 【 /nodejs/nodejs-install-setup.html 】
Node.js 安装配置

 【 /nodejs/nodejs-http-server.html 】
Node.js 创建第一个应用

 【 nodejs-npm.html 】 NPM 使用介绍

 【 nodejs-repl.html 】 Node.js REPL

 【 nodejs-callback.html 】 Node.js 回调函数

 【 nodejs-event-loop.html 】 Node.js 事件循环

 【 nodejs-event.html 】 Node.js EventEmitter

 【 nodejs-buffer.html 】 Node.js Buffer

 【 nodejs-stream.html 】 Node.js Stream

 【 /nodejs/nodejs-module-system.html 】
Node.js 模块系统
。。。。。。。。。。。
这里就不全部给出,你可以自己尝试着运行操作查看所有结果

The above is what I compiled for everyone, I hope it will be helpful to everyone in the future.

Related articles:

How to use SVG in React and Vue projects

Compare the time of the same day through JavaScript

Use vue2.0.js to implement multi-level linkage selector

Use mint-ui to achieve the three-level linkage effect of provinces and municipalities

Use vue to implement secondary route setting method

Achieve multiple routing implementations in Vue-Router2.X

The above is the detailed content of How to implement http crawler in node. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn