Detailed explanation of steps to use nodeJs crawler-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

Detailed explanation of steps to use nodeJs crawler

php中世界最好的语言

May 21, 2018 pm 03:30 PM

javascriptnodejsDetailed explanation

This time I will bring you a detailed explanation of the steps for using the nodeJs crawler. What are the precautions when using the nodeJs crawler? Here are practical cases, let’s take a look.

Background

Recently I plan to review the nodeJs-related content I have seen before, and write a few crawlers to kill the boredom, and I discovered some during the crawling process Questions, record them for future reference.

Dependencies

The cheerio library that is widely available on the Internet is used to process the crawled content, superagent is used to process requests, and log4js is used to record logs.

Log configuration

Without further ado, let’s go directly to the code:

const log4js = require('log4js');
log4js.configure({
 appenders: {
  cheese: {
   type: 'dateFile',
   filename: 'cheese.log',
   pattern: '-yyyy-MM-dd.log',
   // 包含模型
   alwaysIncludePattern: true,
   maxLogSize: 1024,
   backups: 3 }
 },
 categories: { default: { appenders: ['cheese'], level: 'info' } }
});
const logger = log4js.getLogger('cheese');
logger.level = 'INFO';
module.exports = logger;

The above directly exports a logger object and directly calls the logger in the business file. Just use .info() and other functions to add log information, and logs will be generated on a daily basis. There is a lot of relevant information on the Internet.

Crawling content and processing

 superagent.get(cityItemUrl).end((err, res) => {
  if (err) {
   return console.error(err);
  }
  const $ = cheerio.load(res.text);
  // 解析当前页面,获取当前页面的城市链接地址
  const cityInfoEle = $('.newslist1 li a');
  cityInfoEle.each((idx, element) => {
   const $element = $(element);
   const sceneURL = $element.attr('href'); // 页面地址
   const sceneName = $element.attr('title'); // 城市名称
   if (!sceneName) {
    return;
   }
   logger.info(`当前解析到的目的地是: ${sceneName}, 对应的地址为: ${sceneURL}`);
   getDesInfos(sceneURL, sceneName); // 获取城市详细信息
   ep.after('getDirInfoComplete', cityInfoEle.length, (dirInfos) => {
    const content = JSON.parse(fs.readFileSync(path.join(dirname, './imgs.json')));
    dirInfos.forEach((element) => {
     logger.info(`本条数据为:${JSON.stringify(element)}`);
     Object.assign(content, element);
    });
    fs.writeFileSync(path.join(dirname, './imgs.json'), JSON.stringify(content));
   });
  });
 });

Use superagent to request the page. After the request is successful, use cheerio to load the page content, and then use matching rules similar to Jquery to find the target resource. .

Multiple resources are loaded, use eventproxy to proxy events, process one resource and punish one event, and process the data after all events are triggered.

The above is the most basic crawler. Next are some areas that may cause problems or require special attention. . .

Read and write local files

Create folder

function mkdirSync(dirname) {
 if (fs.existsSync(dirname)) {
  return true;
 }
 if (mkdirSync(path.dirname(dirname))) {
  fs.mkdirSync(dirname);
  return true;
 }
 return false;
}

Read and write files

   const content = JSON.parse(fs.readFileSync(path.join(dirname, './dir.json')));
   dirInfos.forEach((element) => {
    logger.info(`本条数据为:${JSON.stringify(element)}`);
    Object.assign(content, element);
   });
   fs.writeFileSync(path.join(dirname, './dir.json'), JSON.stringify(content));

Batch download resources

Downloaded resources may include pictures, audio, etc.

Use Bagpipe to handle asynchronous concurrency. Refer to

const Bagpipe = require('bagpipe');
const bagpipe = new Bagpipe(10);
  bagpipe.push(downloadImage, url, dstpath, (err, data) => {
   if (err) {
    console.log(err);
    return;
   }
   console.log(`[${dstpath}]: ${data}`);
  });

to download resources and use stream to complete file writing.

function downloadImage(src, dest, callback) {
 request.head(src, (err, res, body) => {
  if (src && src.indexOf('http') > -1 || src.indexOf('https') > -1) {
   request(src).pipe(fs.createWriteStream(dest)).on('close', () => {
    callback(null, dest);
   });
  }
 });
}

Encoding

Sometimes the web page content processed directly using cheerio.load is found to be encoded text after writing to the file. You can use

const $ = cheerio.load(buf, { decodeEntities: false });

to disable encoding,

ps: The encoding library and iconv-lite failed to convert utf-8 encoded characters into Chinese. It may be that you are not familiar with the API. You can pay attention to it later.

Finally, attach a regular pattern that matches all dom tags

const reg = /<.>/g;</.>

I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the PHP Chinese website!

Recommended reading:

Detailed explanation of how to use jQuery class name selector (.class)

js encapsulates ajax function function Detailed explanation of implementation steps

The above is the detailed content of Detailed explanation of steps to use nodeJs crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

es6数组怎么去掉重复并且重新排序May 05, 2022 pm 07:08 PM

去掉重复并排序的方法：1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句，去掉数组中的重复元素，返回去重后的新数组；2、利用sort()对去重数组进行排序，语法“去重数组.sort()”。

JavaScript的Symbol类型、隐藏属性及全局注册表详解Jun 02, 2022 am 11:50 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题，包括了Symbol类型的描述、Symbol不会隐式转字符串等问题，下面一起来看一下，希望对大家有帮助。

原来利用纯CSS也能实现文字轮播与图片轮播！Jun 10, 2022 pm 01:00 PM

怎么制作文字轮播与图片轮播？大家第一想到的是不是利用js，其实利用纯CSS也能实现文字轮播与图片轮播，下面来看看实现方法，希望对大家有所帮助！

JavaScript对象的构造函数和new操作符（实例详解）May 10, 2022 pm 06:16 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于对象的构造函数和new操作符，构造函数是所有对象的成员方法中，最早被调用的那个，下面一起来看一下吧，希望对大家有帮助。

javascript怎么移除元素点击事件Apr 11, 2022 pm 04:51 PM

方法：1、利用“点击元素对象.unbind("click");”方法，该方法可以移除被选元素的事件处理程序；2、利用“点击元素对象.off("click");”方法，该方法可以移除通过on()方法添加的事件处理程序。

JavaScript面向对象详细解析之属性描述符May 27, 2022 pm 05:29 PM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于面向对象的相关问题，包括了属性描述符、数据描述符、存取描述符等等内容，下面一起来看一下，希望对大家有帮助。

foreach是es6里的吗May 05, 2022 pm 05:59 PM

foreach不是es6的方法。foreach是es3中一个遍历数组的方法，可以调用数组的每个元素，并将元素传给回调函数进行处理，语法“array.forEach(function(当前元素,索引,数组){...})”；该方法不处理空数组。

整理总结JavaScript常见的BOM操作Jun 01, 2022 am 11:43 AM

本篇文章给大家带来了关于JavaScript的相关知识，其中主要介绍了关于BOM操作的相关问题，包括了window对象的常见事件、JavaScript执行机制等等相关内容，下面一起来看一下，希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

1 months agoByDDD

R.E.P.O. Best Graphic Settings

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.