Home  >  Article  >  Web Front-end  >  How node crawls images from web pages (code attached)

How node crawls images from web pages (code attached)

不言
不言Original
2018-08-17 15:45:202741browse

The content of this article is about how node crawls images from web pages (with code). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Directory

  • Install node and download dependencies

  • Build service

  • Request the page we want to crawl and return json

Install node

We start to install node, you can go to the node official website to download https://nodejs. org/zh-cn/, run node after downloading,

node -v

After successful installation, the version number you installed will appear.

Next we use node to print out hello world, create a new file named index.js and enter

console.log('hello world')

Run this file

node index.js

and it will be output on the control panel hello world

Build a server

Create a new folder named node.

First you need to download the express dependency

npm install express

Then create a new file named demo.js with the directory structure as shown below:

In demo.js introduces the downloaded express

const express = require('express');
const app = express();
app.get('/index', function(req, res) {
res.end('111')
})
var server = app.listen(8081, function() {
    var host = server.address().address
    var port = server.address().port
    console.log("应用实例,访问地址为 http://%s:%s", host, port)

})

Run node demo.js and set up a simple service, as shown in the figure:

Request the page we want to crawl

Request the page we want to crawl

npm install superagent
npm install superagent-charset
npm install cheerio

superagent is used to initiate requests. It is a lightweight, progressive ajax API with good readability, low learning curve, and internal dependence on nodejs native Request api, suitable for nodejs environment. You can also use http to initiate a request

superagent-charset to prevent crawled data from being garbled and change the character format

cheerio is specially customized for the server, fast , flexible and implemented jQuery core implementation. After installing the dependencies, you can introduce them

var superagent = require('superagent');
var charset = require('superagent-charset');
charset(superagent);
const cheerio = require('cheerio');

After importing, request our address, https://www.qqtn.com/tx/weixintx_1.html, as shown in the picture:

Declare the address variable:

const baseUrl = 'https://www.qqtn.com/'

After these settings are completed, the request is sent. Next, please see the complete code demo.js

var superagent = require('superagent');
var charset = require('superagent-charset');
charset(superagent);
var express = require('express');
var baseUrl = 'https://www.qqtn.com/'; //输入任何网址都可以
const cheerio = require('cheerio');
var app = express();
app.get('/index', function(req, res) {
    //设置请求头
    res.header("Access-Control-Allow-Origin", "*");
    res.header('Access-Control-Allow-Methods', 'PUT, GET, POST, DELETE, OPTIONS');
    res.header("Access-Control-Allow-Headers", "X-Requested-With");
    res.header('Access-Control-Allow-Headers', 'Content-Type');
    //类型
    var type = req.query.type;
    //页码
    var page = req.query.page;
    type = type || 'weixin';
    page = page || '1';
    var route = `tx/${type}tx_${page}.html`
    //网页页面信息是gb2312,所以chaeset应该为.charset('gb2312'),一般网页则为utf-8,可以直接使用.charset('utf-8')
    superagent.get(baseUrl + route)
        .charset('gb2312')
        .end(function(err, sres) {
            var items = [];
            if (err) {
                console.log('ERR: ' + err);
                res.json({ code: 400, msg: err, sets: items });
                return;
            }
            var $ = cheerio.load(sres.text);
            $('div.g-main-bg ul.g-gxlist-imgbox li a').each(function(idx, element) {
                var $element = $(element);
                var $subElement = $element.find('img');
                var thumbImgSrc = $subElement.attr('src');
                items.push({
                    title: $(element).attr('title'),
                    href: $element.attr('href'),
                    thumbSrc: thumbImgSrc
                });
            });
            res.json({ code: 200, msg: "", data: items });
        });
});
var server = app.listen(8081, function() {

    var host = server.address().address
    var port = server.address().port

    console.log("应用实例,访问地址为 http://%s:%s", host, port)

})

Running demo.js will return us The data obtained is as shown in the figure:

#A simple node crawler is completed.

Related recommendations:

node crawler gbk web page Chinese garbled solution_html/css_WEB-ITnose

node download Sample code sharing of http small crawler

The above is the detailed content of How node crawls images from web pages (code attached). For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn