Detailed explanation of how to use Node.js to develop a simple image crawling function-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

Detailed explanation of how to use Node.js to develop a simple image crawling function

青灯夜游

Jun 30, 2022 pm 07:55 PM

nodejsnodejsnode

How to use Node for crawling? The following article will talk about using Node.js to develop a simple image crawling function. I hope it will be helpful to you!

Detailed explanation of how to use Node.js to develop a simple image crawling function

The main purpose of the crawler is to collect some specific data that is publicly available on the Internet. Using this data, we can analyze some trends and compare them, or train models for deep learning, etc. In this issue, we will introduce a node.js package specially used for web crawling - node-crawler, and we will use it to complete a simple crawler case to crawl Pictures on the web page and downloaded locally.

Text

node-crawler is a lightweight node.js crawler tool that takes into account both efficiency and Convenience, supports distributed crawler system, supports hard coding, and supports http front-level proxy. Moreover, it is entirely written by nodejs and inherently supports non-blocking asynchronous IO, which provides great convenience for the crawler's pipeline operation mechanism. At the same time, it supports quick selection of DOM (you can use jQuery syntax). It can be said to be a killer function for the task of grabbing specific parts of the web page. There is no need to hand-write regular expressions, which improves Reptile development efficiency.

Installation and introduction

We first create a new project and create index.js as the entry file.

Then install the crawler library node-crawler .

# PNPM
pnpm add crawler
# NPM
npm i -S crawler
# Yarn 
yarn add crawler

Then use require to introduce it.

// index.js
const Crawler = require("crawler");

Create an instance

// index.js
let crawler = new Crawler({
    timeout:10000,
    jQuery:true,
})
function getImages(uri) {
    crawler.queue({
        uri,
        callback: (err, res, done) => {
            if (err) throw err;
        }
    })
}

From now on we will start to write a method to get the image of the html page. crawler After instantiation, it is mainly placed in its queue for Write link and callback methods. This callback function will be called after each request is processed.

I would like to explain here that Crawler uses the request library, so the parameter list available for configuration in Crawler is request A superset of the parameters of the library, that is, all configurations in the request library are applicable to Crawler.

Element Capture

Maybe you also saw the jQuery parameter just now. You guessed it right, it can be captured using the syntax of jQuery DOM element.

// index.js
let data = []
function getImages(uri) {
    crawler.queue({
        uri,
        callback: (err, res, done) => {
            if (err) throw err;
            let $ = res.$;
            try {
                let $imgs = $("img");
                Object.keys($imgs).forEach(index => {
                    let img = $imgs[index];
                    const { type, name, attribs = {} } = img;
                    let src = attribs.src || "";
                    if (type === "tag" && src && !data.includes(src)) {
                        let fileSrc = src.startsWith(&#39;http&#39;) ? src : `https:${src}`
                        let fileName = src.split("/")[src.split("/").length-1]
                        downloadFile(fileSrc, fileName) // 下载图片的方法
                        data.push(src)
                    }
                });
            } catch (e) {
                console.error(e);
                done()
            }
            done();
        }
    })
}

You can see that $ was used to capture the img tag in the request. Then we use the following logic to process the link to the completed image and strip out the name so that it can be saved and named later. An array is also defined here, its purpose is to save the captured image address. If the same image address is found in the next capture, the download will not be processed repeatedly.

The following is the information printed using $("img") on the Nuggets homepage html:

Detailed explanation of how to use Node.js to develop a simple image crawling function

Download pictures

Before downloading, we need to install a nodejs package—— axios, yes you read that right, axios Not only provided to the front end, it can also be used by the back end. But because downloading pictures needs to be processed into a data stream, responseType is set to stream. Then you can use the pipe method to save the data flow file.

const { default: axios } = require("axios");
const fs = require(&#39;fs&#39;);

async function downloadFile(uri, name) {
    let dir = "./imgs"
    if (!fs.existsSync(dir)) {
        await fs.mkdirSync(dir)
    }
    let filePath = `${dir}/${name}`
    let res = await axios({
        url: uri,
        responseType: &#39;stream&#39;
    })
    let ws = fs.createWriteStream(filePath)
    res.data.pipe(ws)
    res.data.on("close",()=>{
        ws.close();
    })
}

Because there may be a lot of pictures, so if you want to put them in one folder, you need to determine whether there is such a folder. If not, create one. Then use the createWriteStream method to save the obtained data stream to the folder in the form of a file.

Then we can try it. For example, we capture the pictures under the html of the Nuggets homepage:

// index.js
getImages("https://juejin.cn/")

After execution, we can find that all the pictures in the static html have been captured.

node index.js

Detailed explanation of how to use Node.js to develop a simple image crawling function

Conclusion

At the end, you can also see that this code may not work SPA (Single Page Application). Since there is only one HTML file in a single-page application, and all the content on the web page is dynamically rendered, it remains the same. No matter what, you can directly handle its data request to collect the information you want. No.

One more thing to say is that many friends use request.js when processing requests to download images. Of course, this is possible and even requires less code, but I want to say What's more, this library has been deprecated in 2020. It is better to replace it with a library that has been updated and maintained.

Detailed explanation of how to use Node.js to develop a simple image crawling function

For more node-related knowledge, please visit: nodejs tutorial!

The above is the detailed content of Detailed explanation of how to use Node.js to develop a simple image crawling function. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:掘金社区. If there is any infringement, please contact admin@php.cn delete

From Websites to Apps: The Diverse Applications of JavaScriptApr 22, 2025 am 12:02 AM

JavaScript is widely used in websites, mobile applications, desktop applications and server-side programming. 1) In website development, JavaScript operates DOM together with HTML and CSS to achieve dynamic effects and supports frameworks such as jQuery and React. 2) Through ReactNative and Ionic, JavaScript is used to develop cross-platform mobile applications. 3) The Electron framework enables JavaScript to build desktop applications. 4) Node.js allows JavaScript to run on the server side and supports high concurrent requests.

Python vs. JavaScript: Use Cases and Applications ComparedApr 21, 2025 am 12:01 AM

Python is more suitable for data science and automation, while JavaScript is more suitable for front-end and full-stack development. 1. Python performs well in data science and machine learning, using libraries such as NumPy and Pandas for data processing and modeling. 2. Python is concise and efficient in automation and scripting. 3. JavaScript is indispensable in front-end development and is used to build dynamic web pages and single-page applications. 4. JavaScript plays a role in back-end development through Node.js and supports full-stack development.

The Role of C/C in JavaScript Interpreters and CompilersApr 20, 2025 am 12:01 AM

C and C play a vital role in the JavaScript engine, mainly used to implement interpreters and JIT compilers. 1) C is used to parse JavaScript source code and generate an abstract syntax tree. 2) C is responsible for generating and executing bytecode. 3) C implements the JIT compiler, optimizes and compiles hot-spot code at runtime, and significantly improves the execution efficiency of JavaScript.

JavaScript in Action: Real-World Examples and ProjectsApr 19, 2025 am 12:13 AM

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.

JavaScript and the Web: Core Functionality and Use CasesApr 18, 2025 am 12:19 AM

The main uses of JavaScript in web development include client interaction, form verification and asynchronous communication. 1) Dynamic content update and user interaction through DOM operations; 2) Client verification is carried out before the user submits data to improve the user experience; 3) Refreshless communication with the server is achieved through AJAX technology.

Understanding the JavaScript Engine: Implementation DetailsApr 17, 2025 am 12:05 AM

Understanding how JavaScript engine works internally is important to developers because it helps write more efficient code and understand performance bottlenecks and optimization strategies. 1) The engine's workflow includes three stages: parsing, compiling and execution; 2) During the execution process, the engine will perform dynamic optimization, such as inline cache and hidden classes; 3) Best practices include avoiding global variables, optimizing loops, using const and lets, and avoiding excessive use of closures.

Python vs. JavaScript: The Learning Curve and Ease of UseApr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Python vs. JavaScript: Community, Libraries, and ResourcesApr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

See all articles