Home >Web Front-end >HTML Tutorial >Teach you how to parse html under nodejs

Teach you how to parse html under nodejs

Y2J
Y2JOriginal
2017-05-22 10:20:019877browse

Parsing of html in the nodejs environment

In the nodejs environment, obtain/parse/parse the data of the sister picture website and use express to json the client Return of data.
This article mainly solves: 1. The problem of how to parse the requested HTML with jquery; 2. The problem of alternative function libraries for heavy users of jquery in the nodejs environment; 3. The problem of how to send ajax requests under nodejs (ajax request, itself is a request request); 4. This article uses actual cases to introduce how to use cheerio to perform DOM operations.

Users need to install the npm module: cheerio
It is also recommended to use the npm module: nodemon, which can hot deploy nodejs programs

Introduction

WeChat Mini Program Platform The basic requirements are:
1. The data server must be a service interface of https protocol
2. The WeChat applet is not html5 and does not support dom parsing and window operations
3. The test version can use third-party data service interfaces, but the official version does not allow the use of third-party interfaces (of course, here we are talking about multiple third-party data interfaces).

Under the APICLOUD platform, we can use html5 with jquery and other class libraries to realize the parsing of dom data to solve the problem that the data source is not in json format ( Use jquery to load data in html5 under html, and go back and sort out the test app I made when I was learning the apicloud platform API), but under the WeChat mini program platform, there is basically no way to parse the html element . Before writing this article, I saw some answers on the Internet about using underscore instead of jquery for DOM analysis. I worked on it for a long time and found that it was still not that smooth.

Therefore, the solution proposed in this article is to solve two problems:
1. Use your own server to provide your own WeChat applet with HTML data conversion services from third-party websites, and convert the third-party HTML The elements parse out the elements they need. Under the nodejs platform, use the request module to complete the data request, and use the cheerio module to complete the html parsing.
2. Under the nodejs platform, although there is a jquery module, there are still many problems in using it. There is a post on the Internet that was copied by a website, giving a method of using jquery in a nodejs environment. After my actual test, I found that it was not possible to start writing code smoothly.

Therefore, the writing ideas of this article: 1. Analyze the data source; 2. Briefly introduce request; 3. Common methods of cheerio module A brief introduction; 4. Written under nodejs, using the express module to provide json data.

Data source analysis

Data list analysis

According to the routines of most programs, operations on third-party data sources are mostly crawler cases, so the case in this article should be the same It’s a blessing for homeboys. The target address of this article is: http://m.mmjpg.com.

Teach you how to parse html under nodejs

##The data source of this article is from

rankingpage. The ranking page looks like this,
enter description here

Sliding the scroll bar at the bottom, we can find a

Load more button , after clicking to load more, in the browser console, you can see that the browser sent a url for http://m.mmjpg.com/getmore.php ?te=1&page=3 request.
Teach you how to parse html under nodejs

The screenshot of the network request displayed by the browser console is as shown below:


enter description hereWe can Use a browser to open the above link (
http://m.mmjpg.com/getmore.php?te=1&page=3). This is what I browsed when I visited this link while writing this article. Real-time data obtained by the browser (readers may get different data from mine when accessing the browser). In the picture below, I have marked the data in the page source code, including the following content: 1. Title; 2. Browsing address of all
pictures; 3. Preview image address; 4. Release time; 5. Number of likes
enter description here

Analysis of data details page

When we accessed to load more pages above, we obtained the page of page=3 list and clicked the following linkhttp://m.mmjpg .com/mm/958, the corresponding title is Beautiful and pure girl’s natural G-cup big breasts pictures.

enter description here

From the above picture, you can see that each http://m.mmjpg.com/mm/958/<a href="http://www.php.cn/wiki/58.html" target="_blank">array</a> There is a picture on the page with serial number , and the address of this picture is also very standardized, which is http://img.mmjpg.com/2017/958/1.jpg. The next thing is very simple. We only need to know how many pictures there are in the current picture collection, and then we can splice the addresses of all pictures according to the rules. Here, to obtain the data of the details page, we only need to obtain the data of the first picture page. The main data obtained is the src of the first img tag under the p of (1/N) and class is content That's it.

Well, that’s all the introduction to data sources above. The analysis of other data sources follows the same idea. I believe that all readers will be able to get the data they want.

request module introduction

request module makes http requests simpler.

The simplest example:

var request = require(&#39;request&#39;); 
request(&#39;http://www.google.com&#39;, function (error, response, body) {
    if (!error && response.statusCode == 200) {
        console.log(body);
    }
})

Catch images from the Internet and save them locally

var fs=require(&#39;fs&#39;);var request=require(&#39;request&#39;);
request(&#39;http://n.sinaimg.cn/news/transform/20170211/F57R-fyamvns4810245.jpg&#39;).pipe(fs.createWriteStream(&#39;doodle.png&#39;));

Upload the local file.json file to mysite.com/obj. json

fs.createReadStream(&#39;file.json&#39;).pipe(request.put(&#39;http://mysite.com/obj.json&#39;))

Upload google.com/img.png to mysite.com/img.png

request.get(&#39;http://google.com/img.png&#39;).pipe(request.put(&#39;http://mysite.com/img.png&#39;))

Submit the form to service.com/upload

var r = request.post(&#39;http://service.com/upload&#39;)var form = r.form()
form.append(&#39;my_field&#39;, &#39;my_value&#39;)
form.append(&#39;my_buffer&#39;, new Buffer([1, 2, 3]))
form.append(&#39;my_file&#39;, fs.createReadStream(path.join(dirname, &#39;doodle.png&#39;))
form.append(&#39;remote_file&#39;, request(&#39;http://google.com/doodle.png&#39;))

HTTP authentication

request.get(&#39;http://some.server.com/&#39;).auth(&#39;username&#39;, &#39;password&#39;, false);

Customized HTTP header

//User-Agent之类可以在options对象中设置。var options = {
 url: &#39;https://api.github.com/repos/mikeal/request&#39;,
 headers: { &#39;User-Agent&#39;: &#39;request&#39;
 }
};function callback(error, response, body) {
 if (!error && response.statusCode == 200) { var info = JSON.parse(body);
 console.log(info.stargazers_count +"Stars");
 console.log(info.forks_count +"Forks");
}
}
request(options, callback);

Cheerio module introduction

cheerio is a fast, flexible and implemented jQuery core implementation specially customized for the server.

Cheerio is supported under the npm official website Introduction to the module:www.npmjs.com/package/cheerio

If you have problems reading the English literature, the Chinese introduction to the cheerio API under the nodejs Chinese community:

cnodejs.org/topic/5203a71844e76d216a727d2e

Comparison of jquery

In fact, if you are under nodejs, use

const cheerio = require('cheerio');To load the cheerio module in this way, use our html source string as a parameter and use the load function of cheerio If loaded, we can completely follow the programming ideas in the jquery environment to realize the parsing of the dom.

Since the

cheerio module implements most of the jquery functions, this article will not introduce too much here.

Combined with the code to tell how to obtain data

Through the above analysis, we can see that our data source is not

json, but html, For jquery, html should set dataType to text when sending an ajax request.

Main process (jquery):

1. Use ajax request, pass in url and set dataType
2. Use
$(data) to ajaxThe obtained data is converted into a jquery object. 3. Use the
find and get methods of jquery to find the element you need to obtain. 4. Use the
attr and html methods of jquery to obtain the required information. 5. Integrate the above information into a json string or perform
dom operations on your html before to complete the data loading.

Main process (nodejs):

1. Use requets to request, pass in the url and set the dataType
2. Use
cheerio.load(body) to request The obtained data is converted into a cheerio object. 3. Use the
find and get methods of cheerio to find the element you need to obtain. 4. Use the
attr and text methods of cheerio to obtain the required information. 5. Integrate the above information into a json string, and use
express's res.json to respond json to the client (mini program or other APP).

Go directly to the code

Getting and parsing the image list under nodejs

var express = require(&#39;express&#39;);var router = express.Router();var bodyParser = require("body-parser");var http = require(&#39;http&#39;);const cheerio = require(&#39;cheerio&#39;);/* GET home page. */router.get(&#39;/&#39;, function (req, res, next) {
    res.render(&#39;index&#39;, {title: &#39;Express&#39;});
});/* GET 妹子图列表  page. */router.get(&#39;/parser&#39;, function (req, res, next) {


    var json =new Array();    var url = `http://m.mmjpg.com/getmore.php?te=0&page=3`;
    var request = require(&#39;request&#39;);
    request(url, function (error, response, body) { if (!error && response.statusCode == 200) {        var $ = cheerio.load( body );//将响应的html转换成cheerio对象
        var $lis = $(&#39;li&#39;);//便利列表页的所有的li标签,每个li标签对应的是一条信息
        var json = new Array();//需要返回的json数组
        var index = 0;
        $lis.each(function () {
            var h2 = $(this).find("h2");//获取h2标签,方便获取h2标签里的a标签
            var a = $(h2).find("a");//获取a标签,是为了得到href和标题
            var img = $(this).find("img");//获取预览图
            var info =$($($(this).find(".info")).find("span")).get(0);//获取发布时间
            var like = $(this).find(".like");//获取点赞次数

            //生成json数组
            json[index] = new Array({"title":$(a).text(),"href":$(a).attr("href"),"image":$(img).attr("data-img"),"timer":$(info).text(),"like":$(like).text()});
            index++;
        })        //设置响应头
        res.header("contentType", "application/json");        //返回json数据
        res.json(json);
        }
    });

})
;/**
 * 从第(1/50)张这样的字符串中提取50出来
 * @param $str
 * @returns {string}
 */function getNumberFromString($str) {
    var start = $str.indexOf("/");    var end = $str.indexOf(")");    return $str.substring(start+1,end);
}/* GET 妹子图所有图片  page. */router.get(&#39;/details&#39;, function (req, res, next) {

    var json;    var url = `http://m.mmjpg.com/mm/958`;
    var request = require(&#39;request&#39;);
    request(url, function (error, response, body) {
        if (!error && response.statusCode == 200) {        var $ = cheerio.load( body );//将响应的html转换成cheerio对象

        var json = new Array();//需要返回的json数组
        var index = 0;        var img = $($(".content").find("a")).find("img");//每一次操作之后得到的对象都用转换成cheerio对象的
        var imgSrc = $(img).attr("src");//获取第一张图片的地址
        var title = $(img).attr("alt");//获取图片集的标题
        var total  =$($(".contentpage").find("span").get(1)).text();//获取‘第(1/50)张’
        total = getNumberFromString(total);//从例如`第(1/50)张`提取出50来
        var imgPre = imgSrc.substring(0,imgSrc.lastIndexOf("/")+1);//获取图片的地址的前缀
        var imgFix = imgSrc.substring(imgSrc.lastIndexOf("."));//获取图片的格式后缀名
        console.log(imgPre + "\t" + imgFix);        //生成json数组
        var images= new Array();        for(var i=1;i<=total;i++) {
            images[i-1] =imgPre+i+imgFix;
        }
        json = new Array({"title":title,"images":images});        //设置响应头
        res.header("contentType", "application/json");        //返回json数据
        res.json(json);
    }
    });

})
;

module.exports = router;

While browsing, get the json of the list page, the screenshot is as follows:

enter description here

Get the json of the details page, the screenshot is as follows:

enter description here

The json above has been format checked and is valid.

enter description here

apicloud平台下利用jquery实现的代码

//获取妹子图的列表function loadData() {
    url = &#39;http://m.mmjpg.com/getmore.php&#39;;
    $.ajax({
        url: tmpurl,
        method: &#39;get&#39;,
        dataType: "application/text",
        data:{
            te:0,
            page:3
        },
        success: function (data) {
            if (data) {
                ret = "<ul>" + ret + "</ul>";                var lis = $(ret).find("li");                var one = &#39;&#39;;
                $(lis).each(function () {
                    var a = $(this).find("h2 a");                    var ahtml = $(a).html();//标题
                    var ahref = $(a).attr(&#39;href&#39;);//链接
                    var info = $(this).find(".info");                    var date = $($(info).find("span").get(0)).html();                    var like = $($(info).find(".like")).html();                    var img = $(this).find("img").get(0);                    var imgsrc = $(img).attr(&#39;data-img&#39;);                    //接下来,决定如何对数据进行显示咯。如dom操作,直接显示。
                });
            } else {
                alert("数据加载失败,请重试");
            }
        }
    });
};//end of loadData//图片详情页的获取,就不再提供jquery版本的代码了

总结

本文主要解决了:1.jquery解析请求过来的html如何实现的问题;2.nodejs环境下jquery重度使用者的替代函数库的问题;3.nodejs下,如何发送ajax请求的问题(ajax请求,本身就是一个request请求);4. 本文用实际的案例来介绍了如何使用cheerio进行dom操作。

【相关推荐】

1. HTML免费视频教程

2. html实现消息按钮上的数量角标的实例详解

3. html中怎么样才能让JSON数据显示的方法介绍

4. 对HTTP Headers知识点的图文说明

5. XHTML中的超链接标签使用教程

The above is the detailed content of Teach you how to parse html under nodejs. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn