A powerful crawler based on Node.js that can directly publish crawled articles

Home

Web Front-end

JS Tutorial

A powerful crawler based on Node.js that can directly publish crawled articles_node.js

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 16, 2016 pm 03:20 PM

node.jsreptile

1. Environment configuration

1) Build a server, any Linux will do, I use CentOS 6.5;

2) Install a mysql database, either 5.5 or 5.6. You can install it directly with lnmp or lamp to save trouble. You can also read the logs directly in the browser later;

3) First install a node.js environment. I am using 0.12.7. I have not tried later versions;

4) Execute npm -g install forever to install forever so that the crawler can run in the background;

5) Organize all the code locally (integration = git clone);

6) Execute npm install in the project directory to install dependent libraries;

7) Create two empty folders, json and avatar, in the project directory;

8) Create an empty mysql database and a user with full permissions, execute setup.sql and startusers.sql in the code successively, create the database structure and import the initial seed user;

9) Edit config.js, the configuration items marked (required) must be filled in or modified, and the remaining items can be left unchanged for the time being:

exports.jsonPath = "./json/";//生成json文件的路径
exports.avatarPath = "./avatar/";//保存头像文件的路径
exports.dbconfig = {
  host: 'localhost',//数据库服务器（必须）
  user: 'dbuser',//数据库用户名（必须）
  password: 'dbpassword',//数据库密码（必须）
  database: 'dbname',//数据库名（必须）
  port: 3306,//数据库服务器端口
  poolSize: 20,
  acquireTimeout: 30000
};
  
exports.urlpre = "http://www.jb51.net/";//脚本网址
exports.urlzhuanlanpre = "http://www.jb51.net/list/index_96.htm/";//脚本网址
  
exports.WPurl = "www.xxx.com";//要发布文章的wordpress网站地址
exports.WPusername = "publishuser";//发布文章的用户名
exports.WPpassword = "publishpassword";//发布文章用户的密码
exports.WPurlavatarpre = "http://www.xxx.com/avatar/";//发布文章中替代原始头像的url地址
  
exports.mailservice = "QQ";//邮件通知服务类型，也可以用Gmail，前提是你访问得了Gmail（必须）
exports.mailuser = "12345@qq.com";//邮箱用户名（必须）
exports.mailpass = "qqpassword";//邮箱密码（必须）
exports.mailfrom = "12345@qq.com";//发送邮件地址（必须，一般与用户名所属邮箱一致）
exports.mailto = "12345@qq.com";//接收通知邮件地址（必须）

Save and proceed to the next step.

2. Crawler users

The principle of the crawler is actually to simulate a real Zhihu user clicking around on the website and collecting data, so we need to have a real Zhihu user. For testing, you can use your own account, but for long-term reasons, it is better to register a special account. One is enough, and the current crawler only supports one. Our simulation process does not have to log in from the homepage like a real user, but directly borrows the cookie value:

After registering, activating and logging in, go to your homepage, use any browser with developer mode or cookie plug-in, and open your own cookies in Zhihu. There may be a very complex list, but we only need a part of it, namely "z_c0". Copy the z_c0 part of your own cookie, leaving out the equal signs, quotation marks, and semicolons. The final format is roughly like this:

z_c0="LA8kJIJFdDSOA883wkUGJIRE8jVNKSOQfB9430=|1420113988|a6ea18bc1b23ea469e3b5fb2e33c2828439cb";

Insert a row of records in the cookies table of the mysql database, where the field values are:

email: The login email of the crawler user
password: the password of the crawler user
name: crawler username
hash: The hash of the crawler user (a unique identifier that cannot be modified by each user. In fact, it is not used here and can be left blank temporarily)
cookie: the cookie you copied just now

Then it can officially start running. If the cookie expires or the user is blocked, just modify the cookie field in this row of records.

3. Operation

It is recommended to use forever to execute, which not only facilitates background running and logging, but also automatically restarts after a crash. Example:

forever -l /var/www/log.txt index.js

The address after -l is where the log is recorded. If it is placed in the web server directory, it can be accessed in the browser through http://www.xxx.com/log.txt Check the log directly. Add parameters (separated by spaces) after index.js to execute different crawler instructions:
1. -i executes immediately. If this parameter is not added, it will be executed at the next specified time by default, such as 0:05 every morning;
2. -ng skips the phase of fetching new users, that is, getnewuser;
3. -ns skips the snapshot phase, that is, usersnapshot;
4. -nf skips the data file generation stage, that is, saveviewfile;
5. -db displays debugging logs.
The functions of each stage are introduced in the next section. In order to facilitate the operation, you can write this line of command as an sh script, for example:

#!/bin/bash
cd /usr/zhihuspider
rm -f /var/www/log.txt
forever -l /var/www/log.txt start index.js $*

Please replace the specific path with your own. In this way, you can start the crawler by adding parameters to ./zhihuspider.sh: For example, ./zhihuspider.sh -i -ng -nf starts the task immediately and skips the new user and file saving stages. The method to stop the crawler is forever stopall (or stop sequence number).

4. Overview of principles

See that the entry file for Zhihu crawler is index.js. It executes crawler tasks at specified times every day in a loop. There are three tasks that are executed sequentially every day, namely:

1) getnewuser.js: Capture new user information by comparing the list of user followers in the current library. Relying on this mechanism, you can automatically list the worthy users on Zhihu New people are added to the library;

2) usersnapshot.js: Loops to capture user information and answer lists in the current library, and save them in the form of daily snapshots.

3) saveviewfile.js: Generate a user analysis list based on the content of the latest snapshot, and filter out yesterday, recent and historical essence answers and publish them to the "Kanzhihu" website .

After the above three tasks are completed, the main thread will refresh the Zhihu homepage every few minutes to verify whether the current cookie is still valid. If it is invalid (jumping to the non-login page), a notification email will be sent to the specified mailbox. , remind you to change cookies in time. The method of changing cookies is the same as during initialization. You only need to log in manually once and then take out the cookie value. If you are interested in the specific code implementation, you can carefully read the comments inside, adjust some configurations, or even try to reconstruct the entire crawler yourself.

Tips

1) The principle of getnewuser is to specify the capture by comparing the number of users' followings in the snapshots of the two days before and after, so it must have at least two snapshots before it can be started. Even if it is executed before, it will be automatically skipped.

2) Half of the snapshot can be restored. If the program crashes due to an error, use forever stop to stop it, and then add the parameters -i -ng to execute it immediately and skip the new user phase, so that you can continue from the half-captured snapshot.

3) Do not easily increase the number of (pseudo) threads when taking snapshots, that is, the maxthreadcount attribute in usersnapshots. Too many threads will cause 429 errors, and the large amount of data captured may not be written to the database in time, causing memory overflow. Therefore, unless your database is on an SSD, do not exceed 10 threads.

4) The work of savingviewfile to generate analysis results requires snapshots of at least the past 7 days. If the snapshot content is less than 7 days old, an error will be reported and skipped. Previous analysis work can be performed by manually querying the database.

5) Considering that most people do not need to copy a "Kanzhihu", the entry to the automatic publishing WordPress article function has been commented out. If you have set up WordPress, remember to enable xmlrpc, then set up a user specifically for publishing articles, configure the corresponding parameters in config.js and uncomment the relevant code in saveviewfile.

6) Since Zhihu has implemented anti-leeching treatment for avatars, we also obtained the avatars when capturing user information and saved them locally. When publishing articles, we used the local avatar address. You need to point the URL path in the http server to the folder where the avatar is saved, or place the folder where the avatar is saved directly in the website directory.

7) The code may not be easy to read. In addition to the fact that the callback structure of node.js itself is quite confusing, part of the reason is that when I first wrote the program, I had just started to come into contact with node.js. There were many unfamiliar places that caused the structure to be confusing and I didn’t have time to correct it; another part was that after many times There are many ugly judgment conditions and retry rules accumulated in the patchwork. If they are all removed, the code volume may be reduced by two-thirds. But there is no way around it. In order to ensure the stable operation of a system, these must be added.

8) This crawler source code is based on the WTFPL protocol and does not impose any restrictions on modification and release.

The above is the entire content of this article, I hope it will be helpful to everyone’s study.

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Vercel是什么？怎么部署Node服务？May 07, 2022 pm 09:34 PM

Vercel是什么？本篇文章带大家了解一下Vercel，并介绍一下在Vercel中部署 Node 服务的方法，希望对大家有所帮助！

node.js gm是什么Jul 12, 2022 pm 06:28 PM

gm是基于node.js的图片处理插件，它封装了图片处理工具GraphicsMagick（GM）和ImageMagick（IM），可使用spawn的方式调用。gm插件不是node默认安装的，需执行“npm install gm -S”进行安装才可使用。

一文解析package.json和package-lock.jsonSep 01, 2022 pm 08:02 PM

本篇文章带大家详解package.json和package-lock.json文件，希望对大家有所帮助！

怎么使用pkg将Node.js项目打包为可执行文件？Jul 26, 2022 pm 07:33 PM

如何用pkg打包nodejs可执行文件？下面本篇文章给大家介绍一下使用pkg将Node.js项目打包为可执行文件的方法，希望对大家有所帮助！

分享一个Nodejs web框架：FastifyAug 04, 2022 pm 09:23 PM

本篇文章给大家分享一个Nodejs web框架：Fastify，简单介绍一下Fastify支持的特性、Fastify支持的插件以及Fastify的使用方法，希望对大家有所帮助！

node爬取数据实例：聊聊怎么抓取小说章节May 02, 2022 am 10:00 AM

node怎么爬取数据？下面本篇文章给大家分享一个node爬虫实例，聊聊利用node抓取小说章节的方法，希望对大家有所帮助！

手把手带你使用Node.js和adb开发一个手机备份小工具Apr 14, 2022 pm 09:06 PM

本篇文章给大家分享一个Node实战，介绍一下使用Node.js和adb怎么开发一个手机备份小工具，希望对大家有所帮助！

图文详解node.js如何构建web服务器Aug 08, 2022 am 10:27 AM

先介绍node.js的安装，再介绍使用node.js构建一个简单的web服务器，最后通过一个简单的示例，演示网页与服务器之间的数据交互的实现。

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

1 months agoByDDD

R.E.P.O. Best Graphic Settings

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

WebStorm Mac version

Useful JavaScript development tools

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Hot Topics

Where is the login entrance for gmail email?

7403

1630

1358

1268

1218