search
HomeWeb Front-endHTML TutorialTeach you how to parse html under nodejs

Parsing of html in the nodejs environment

In the nodejs environment, obtain/parse/parse the data of the sister picture website and use express to json the client Return of data.
This article mainly solves: 1. The problem of how to parse the requested HTML with jquery; 2. The problem of alternative function libraries for heavy users of jquery in the nodejs environment; 3. The problem of how to send ajax requests under nodejs (ajax request, itself is a request request); 4. This article uses actual cases to introduce how to use cheerio to perform DOM operations.

Users need to install the npm module: cheerio
It is also recommended to use the npm module: nodemon, which can hot deploy nodejs programs

Introduction

WeChat Mini Program Platform The basic requirements are:
1. The data server must be a service interface of https protocol
2. The WeChat applet is not html5 and does not support dom parsing and window operations
3. The test version can use third-party data service interfaces, but the official version does not allow the use of third-party interfaces (of course, here we are talking about multiple third-party data interfaces).

Under the APICLOUD platform, we can use html5 with jquery and other class libraries to realize the parsing of dom data to solve the problem that the data source is not in json format ( Use jquery to load data in html5 under html, and go back and sort out the test app I made when I was learning the apicloud platform API), but under the WeChat mini program platform, there is basically no way to parse the html element . Before writing this article, I saw some answers on the Internet about using underscore instead of jquery for DOM analysis. I worked on it for a long time and found that it was still not that smooth.

Therefore, the solution proposed in this article is to solve two problems:
1. Use your own server to provide your own WeChat applet with HTML data conversion services from third-party websites, and convert the third-party HTML The elements parse out the elements they need. Under the nodejs platform, use the request module to complete the data request, and use the cheerio module to complete the html parsing.
2. Under the nodejs platform, although there is a jquery module, there are still many problems in using it. There is a post on the Internet that was copied by a website, giving a method of using jquery in a nodejs environment. After my actual test, I found that it was not possible to start writing code smoothly.

Therefore, the writing ideas of this article: 1. Analyze the data source; 2. Briefly introduce request; 3. Common methods of cheerio module A brief introduction; 4. Written under nodejs, using the express module to provide json data.

Data source analysis

Data list analysis

According to the routines of most programs, operations on third-party data sources are mostly crawler cases, so the case in this article should be the same It’s a blessing for homeboys. The target address of this article is: http://m.mmjpg.com.

Teach you how to parse html under nodejs

##The data source of this article is from

rankingpage. The ranking page looks like this,
enter description here

Sliding the scroll bar at the bottom, we can find a

Load more button , after clicking to load more, in the browser console, you can see that the browser sent a url for http://m.mmjpg.com/getmore.php ?te=1&page=3 request.
Teach you how to parse html under nodejs

The screenshot of the network request displayed by the browser console is as shown below:


enter description hereWe can Use a browser to open the above link (
http://m.mmjpg.com/getmore.php?te=1&page=3). This is what I browsed when I visited this link while writing this article. Real-time data obtained by the browser (readers may get different data from mine when accessing the browser). In the picture below, I have marked the data in the page source code, including the following content: 1. Title; 2. Browsing address of all
pictures; 3. Preview image address; 4. Release time; 5. Number of likes
enter description here

Analysis of data details page

When we accessed to load more pages above, we obtained the page of page=3 list and clicked the following linkhttp://m.mmjpg .com/mm/958, the corresponding title is Beautiful and pure girl’s natural G-cup big breasts pictures.

enter description here

From the above picture, you can see that each http://m.mmjpg.com/mm/958/<a href="http://www.php.cn/wiki/58.html" target="_blank">array</a> There is a picture on the page with serial number , and the address of this picture is also very standardized, which is http://img.mmjpg.com/2017/958/1.jpg. The next thing is very simple. We only need to know how many pictures there are in the current picture collection, and then we can splice the addresses of all pictures according to the rules. Here, to obtain the data of the details page, we only need to obtain the data of the first picture page. The main data obtained is the src of the first img tag under the p of (1/N) and class is content That's it.

Well, that’s all the introduction to data sources above. The analysis of other data sources follows the same idea. I believe that all readers will be able to get the data they want.

request module introduction

request module makes http requests simpler.

The simplest example:

var request = require(&#39;request&#39;); 
request(&#39;http://www.google.com&#39;, function (error, response, body) {
    if (!error && response.statusCode == 200) {
        console.log(body);
    }
})

Catch images from the Internet and save them locally

var fs=require(&#39;fs&#39;);var request=require(&#39;request&#39;);
request(&#39;http://n.sinaimg.cn/news/transform/20170211/F57R-fyamvns4810245.jpg&#39;).pipe(fs.createWriteStream(&#39;doodle.png&#39;));

Upload the local file.json file to mysite.com/obj. json

fs.createReadStream(&#39;file.json&#39;).pipe(request.put(&#39;http://mysite.com/obj.json&#39;))

Upload google.com/img.png to mysite.com/img.png

request.get(&#39;http://google.com/img.png&#39;).pipe(request.put(&#39;http://mysite.com/img.png&#39;))

Submit the form to service.com/upload

var r = request.post(&#39;http://service.com/upload&#39;)var form = r.form()
form.append(&#39;my_field&#39;, &#39;my_value&#39;)
form.append(&#39;my_buffer&#39;, new Buffer([1, 2, 3]))
form.append(&#39;my_file&#39;, fs.createReadStream(path.join(dirname, &#39;doodle.png&#39;))
form.append(&#39;remote_file&#39;, request(&#39;http://google.com/doodle.png&#39;))

HTTP authentication

request.get(&#39;http://some.server.com/&#39;).auth(&#39;username&#39;, &#39;password&#39;, false);

Customized HTTP header

//User-Agent之类可以在options对象中设置。var options = {
 url: &#39;https://api.github.com/repos/mikeal/request&#39;,
 headers: { &#39;User-Agent&#39;: &#39;request&#39;
 }
};function callback(error, response, body) {
 if (!error && response.statusCode == 200) { var info = JSON.parse(body);
 console.log(info.stargazers_count +"Stars");
 console.log(info.forks_count +"Forks");
}
}
request(options, callback);

Cheerio module introduction

cheerio is a fast, flexible and implemented jQuery core implementation specially customized for the server.

Cheerio is supported under the npm official website Introduction to the module:www.npmjs.com/package/cheerio

If you have problems reading the English literature, the Chinese introduction to the cheerio API under the nodejs Chinese community:

cnodejs.org/topic/5203a71844e76d216a727d2e

Comparison of jquery

In fact, if you are under nodejs, use

const cheerio = require('cheerio');To load the cheerio module in this way, use our html source string as a parameter and use the load function of cheerio If loaded, we can completely follow the programming ideas in the jquery environment to realize the parsing of the dom.

Since the

cheerio module implements most of the jquery functions, this article will not introduce too much here.

Combined with the code to tell how to obtain data

Through the above analysis, we can see that our data source is not

json, but html, For jquery, html should set dataType to text when sending an ajax request.

Main process (jquery):

1. Use ajax request, pass in url and set dataType
2. Use
$(data) to ajaxThe obtained data is converted into a jquery object. 3. Use the
find and get methods of jquery to find the element you need to obtain. 4. Use the
attr and html methods of jquery to obtain the required information. 5. Integrate the above information into a json string or perform
dom operations on your html before to complete the data loading.

Main process (nodejs):

1. Use requets to request, pass in the url and set the dataType
2. Use
cheerio.load(body) to request The obtained data is converted into a cheerio object. 3. Use the
find and get methods of cheerio to find the element you need to obtain. 4. Use the
attr and text methods of cheerio to obtain the required information. 5. Integrate the above information into a json string, and use
express's res.json to respond json to the client (mini program or other APP).

Go directly to the code

Getting and parsing the image list under nodejs

var express = require(&#39;express&#39;);var router = express.Router();var bodyParser = require("body-parser");var http = require(&#39;http&#39;);const cheerio = require(&#39;cheerio&#39;);/* GET home page. */router.get(&#39;/&#39;, function (req, res, next) {
    res.render(&#39;index&#39;, {title: &#39;Express&#39;});
});/* GET 妹子图列表  page. */router.get(&#39;/parser&#39;, function (req, res, next) {


    var json =new Array();    var url = `http://m.mmjpg.com/getmore.php?te=0&page=3`;
    var request = require(&#39;request&#39;);
    request(url, function (error, response, body) { if (!error && response.statusCode == 200) {        var $ = cheerio.load( body );//将响应的html转换成cheerio对象
        var $lis = $(&#39;li&#39;);//便利列表页的所有的li标签,每个li标签对应的是一条信息
        var json = new Array();//需要返回的json数组
        var index = 0;
        $lis.each(function () {
            var h2 = $(this).find("h2");//获取h2标签,方便获取h2标签里的a标签
            var a = $(h2).find("a");//获取a标签,是为了得到href和标题
            var img = $(this).find("img");//获取预览图
            var info =$($($(this).find(".info")).find("span")).get(0);//获取发布时间
            var like = $(this).find(".like");//获取点赞次数

            //生成json数组
            json[index] = new Array({"title":$(a).text(),"href":$(a).attr("href"),"image":$(img).attr("data-img"),"timer":$(info).text(),"like":$(like).text()});
            index++;
        })        //设置响应头
        res.header("contentType", "application/json");        //返回json数据
        res.json(json);
        }
    });

})
;/**
 * 从第(1/50)张这样的字符串中提取50出来
 * @param $str
 * @returns {string}
 */function getNumberFromString($str) {
    var start = $str.indexOf("/");    var end = $str.indexOf(")");    return $str.substring(start+1,end);
}/* GET 妹子图所有图片  page. */router.get(&#39;/details&#39;, function (req, res, next) {

    var json;    var url = `http://m.mmjpg.com/mm/958`;
    var request = require(&#39;request&#39;);
    request(url, function (error, response, body) {
        if (!error && response.statusCode == 200) {        var $ = cheerio.load( body );//将响应的html转换成cheerio对象

        var json = new Array();//需要返回的json数组
        var index = 0;        var img = $($(".content").find("a")).find("img");//每一次操作之后得到的对象都用转换成cheerio对象的
        var imgSrc = $(img).attr("src");//获取第一张图片的地址
        var title = $(img).attr("alt");//获取图片集的标题
        var total  =$($(".contentpage").find("span").get(1)).text();//获取‘第(1/50)张’
        total = getNumberFromString(total);//从例如`第(1/50)张`提取出50来
        var imgPre = imgSrc.substring(0,imgSrc.lastIndexOf("/")+1);//获取图片的地址的前缀
        var imgFix = imgSrc.substring(imgSrc.lastIndexOf("."));//获取图片的格式后缀名
        console.log(imgPre + "\t" + imgFix);        //生成json数组
        var images= new Array();        for(var i=1;i<=total;i++) {
            images[i-1] =imgPre+i+imgFix;
        }
        json = new Array({"title":title,"images":images});        //设置响应头
        res.header("contentType", "application/json");        //返回json数据
        res.json(json);
    }
    });

})
;

module.exports = router;

While browsing, get the json of the list page, the screenshot is as follows:

enter description here

Get the json of the details page, the screenshot is as follows:

enter description here

The json above has been format checked and is valid.

enter description here

apicloud平台下利用jquery实现的代码

//获取妹子图的列表function loadData() {
    url = &#39;http://m.mmjpg.com/getmore.php&#39;;
    $.ajax({
        url: tmpurl,
        method: &#39;get&#39;,
        dataType: "application/text",
        data:{
            te:0,
            page:3
        },
        success: function (data) {
            if (data) {
                ret = "<ul>" + ret + "</ul>";                var lis = $(ret).find("li");                var one = &#39;&#39;;
                $(lis).each(function () {
                    var a = $(this).find("h2 a");                    var ahtml = $(a).html();//标题
                    var ahref = $(a).attr(&#39;href&#39;);//链接
                    var info = $(this).find(".info");                    var date = $($(info).find("span").get(0)).html();                    var like = $($(info).find(".like")).html();                    var img = $(this).find("img").get(0);                    var imgsrc = $(img).attr(&#39;data-img&#39;);                    //接下来,决定如何对数据进行显示咯。如dom操作,直接显示。
                });
            } else {
                alert("数据加载失败,请重试");
            }
        }
    });
};//end of loadData//图片详情页的获取,就不再提供jquery版本的代码了

总结

本文主要解决了:1.jquery解析请求过来的html如何实现的问题;2.nodejs环境下jquery重度使用者的替代函数库的问题;3.nodejs下,如何发送ajax请求的问题(ajax请求,本身就是一个request请求);4. 本文用实际的案例来介绍了如何使用cheerio进行dom操作。

【相关推荐】

1. HTML免费视频教程

2. html实现消息按钮上的数量角标的实例详解

3. html中怎么样才能让JSON数据显示的方法介绍

4. 对HTTP Headers知识点的图文说明

5. XHTML中的超链接标签使用教程

The above is the detailed content of Teach you how to parse html under nodejs. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Vercel是什么?怎么部署Node服务?Vercel是什么?怎么部署Node服务?May 07, 2022 pm 09:34 PM

Vercel是什么?本篇文章带大家了解一下Vercel,并介绍一下在Vercel中部署 Node 服务的方法,希望对大家有所帮助!

node.js gm是什么node.js gm是什么Jul 12, 2022 pm 06:28 PM

gm是基于node.js的图片处理插件,它封装了图片处理工具GraphicsMagick(GM)和ImageMagick(IM),可使用spawn的方式调用。gm插件不是node默认安装的,需执行“npm install gm -S”进行安装才可使用。

火了!新的JavaScript运行时:Bun,性能完爆Node火了!新的JavaScript运行时:Bun,性能完爆NodeJul 15, 2022 pm 02:03 PM

今天跟大家介绍一个最新开源的 javaScript 运行时:Bun.js。比 Node.js 快三倍,新 JavaScript 运行时 Bun 火了!

nodejs中lts是什么意思nodejs中lts是什么意思Jun 29, 2022 pm 03:30 PM

在nodejs中,lts是长期支持的意思,是“Long Time Support”的缩写;Node有奇数版本和偶数版本两条发布流程线,当一个奇数版本发布后,最近的一个偶数版本会立即进入LTS维护计划,一直持续18个月,在之后会有12个月的延长维护期,lts期间可以支持“bug fix”变更。

聊聊Node.js中的多进程和多线程聊聊Node.js中的多进程和多线程Jul 25, 2022 pm 07:45 PM

大家都知道 Node.js 是单线程的,却不知它也提供了多进(线)程模块来加速处理一些特殊任务,本文便带领大家了解下 Node.js 的多进(线)程,希望对大家有所帮助!

node爬取数据实例:聊聊怎么抓取小说章节node爬取数据实例:聊聊怎么抓取小说章节May 02, 2022 am 10:00 AM

node怎么爬取数据?下面本篇文章给大家分享一个node爬虫实例,聊聊利用node抓取小说章节的方法,希望对大家有所帮助!

深入浅析Nodejs中的net模块深入浅析Nodejs中的net模块Apr 11, 2022 pm 08:40 PM

本篇文章带大家带大家了解一下Nodejs中的net模块,希望对大家有所帮助!

怎么获取Node性能监控指标?获取方法分享怎么获取Node性能监控指标?获取方法分享Apr 19, 2022 pm 09:25 PM

怎么获取Node性能监控指标?本篇文章来和大家聊聊Node性能监控指标获取方法,希望对大家有所帮助!

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function