Why choose to use node to write a crawler? It’s because the cheerio library is fully compatible with jQuery syntax. If you are familiar with it, it is really fun to use
Dependency selection
cheerio: Node .js version of jQuery
http: encapsulates an HTTP server and a simple HTTP client
iconv-lite: Solve the problem of garbled characters when crawling gb2312 web pages
Initial implementation
Since we want to crawl the website content, we should first take a look at the basic structure of the website
We selected Movie Paradise as the target website and want to crawl all the latest movie download links
Analysis page
The page structure is as follows:
We can see that the title of each movie is under a <a href="http://www.php.cn/wiki/164.html" target="_blank">class</a>
ulink
a
tag. Moving up, we can See that the outermost box class
is co_content8
ok, you can start work
Get a page of movie titles
First Introduce dependencies and set the url to be crawled
var cheerio = require('cheerio'); var http = require('http'); var iconv = require('iconv-lite'); var url = 'http://www.ygdy8.net/html/gndy/dyzz/index.html';
Core codeindex.js
http.get(url, function(sres) { var chunks = []; sres.on('data', function(chunk) { chunks.push(chunk); }); // chunks里面存储着网页的 html 内容,将它zhuan ma传给 cheerio.load 之后 // 就可以得到一个实现了 jQuery 接口的变量,将它命名为 `$` // 剩下就都是 jQuery 的内容了 sres.on('end', function() { var titles = []; //由于咱们发现此网页的编码格式为gb2312,所以需要对其进行转码,否则乱码 //依据:“<meta>” var html = iconv.decode(Buffer.concat(chunks), 'gb2312'); var $ = cheerio.load(html, {decodeEntities: false}); $('.co_content8 .ulink').each(function (idx, element) { var $element = $(element); titles.push({ title: $element.text() }) }) console.log(titles); }); });
Runnode index
The results are as follows
Successfully obtained the movie title. If I want to obtain the titles of multiple pages, it is impossible to change the URLs one by one. Of course there is a way to do this, please read on!
Get multi-page movie titles
We only need to encapsulate the previous code into a functionand recursivelyexecute it and we are done
Core Codeindex.js
var index = 1; //页面数控制 var url = 'http://www.ygdy8.net/html/gndy/dyzz/list_23_'; var titles = []; //用于保存title function getTitle(url, i) { console.log("正在获取第" + i + "页的内容"); http.get(url + i + '.html', function(sres) { var chunks = []; sres.on('data', function(chunk) { chunks.push(chunk); }); sres.on('end', function() { var html = iconv.decode(Buffer.concat(chunks), 'gb2312'); var $ = cheerio.load(html, {decodeEntities: false}); $('.co_content8 .ulink').each(function (idx, element) { var $element = $(element); titles.push({ title: $element.text() }) }) if(i <p>The results are as follows<br><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/013/79a1edcfe9244c5fb6f23f007f455aaf-2.png?x-oss-process=image/resize,p_40" class="lazy" style="max-width:90%" style="max-width:90%" title="How to implement a simple crawler using Node.js" alt="How to implement a simple crawler using Node.js"></p><h4 id="Get-the-movie-download-link">Get the movie download link</h4><p>If it is a manual operation, we need In one operation, you can find the download address by clicking into the movie details page <br> So how do we implement it through node </p><p> Let’s analyze the routine first <a href="http://www.php.cn/code/7955.html" target="_blank"> Page layout </a><br><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/013/45e6e69669b80c60f0e7eabd78b3a018-3.png?x-oss-process=image/resize,p_40" class="lazy" style="max-width:90%" style="max-width:90%" title="How to implement a simple crawler using Node.js" alt="How to implement a simple crawler using Node.js"></p><p> If we want to accurately locate the download link, we need to first find the p with <code>id</code> as <code>Zoom</code>, and the download link is under this <code>p</code> Within the <code>a</code> tag under ##tr<code>. </code></p>Then we will <p>define a function<a href="http://www.php.cn/code/8119.html" target="_blank"> to get the download link</a></p>getBtLink()<p></p><pre class="brush:php;toolbar:false">function getBtLink(urls, n) { //urls里面包含着所有详情页的地址 console.log("正在获取第" + n + "个url的内容"); http.get('http://www.ygdy8.net' + urls[n].title, function(sres) { var chunks = []; sres.on('data', function(chunk) { chunks.push(chunk); }); sres.on('end', function() { var html = iconv.decode(Buffer.concat(chunks), 'gb2312'); //进行转码 var $ = cheerio.load(html, {decodeEntities: false}); $('#Zoom td').children('a').each(function (idx, element) { var $element = $(element); btLink.push({ bt: $element.attr('href') }) }) if(n Run again<p>node index<code></code><br><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/013/2816c9cbd03b1466c255e54c10156e14-4.png?x-oss-process=image/resize,p_40" class="lazy" style="max-width:90%" style="max-width:90%" title="How to implement a simple crawler using Node.js" alt="How to implement a simple crawler using Node.js"><br><img src="/static/imghwm/default1.png" data-src="https://img.php.cn/upload/article/000/000/013/8eb570e10f1a4e755ebffd92bd150760-5.png?x-oss-process=image/resize,p_40" class="lazy" style="max-width:90%" style="max-width:90%" title="How to implement a simple crawler using Node.js" alt="How to implement a simple crawler using Node.js"></p>In this way we have obtained the download links for all the movies in the three pages. Isn’t it very simple? <p></p>Save data<h2></h2>Of course we need to save the data after crawling it. Here I chose <p>MongoDB<a href="http://www.php.cn/wiki/1523.html" target="_blank"> to save it</a> </p>Data saving function<p>save()<code></code></p><pre class="brush:php;toolbar:false">function save() { var MongoClient = require('mongodb').MongoClient; //导入依赖 MongoClient.connect(mongo_url, function (err, db) { if (err) { console.error(err); return; } else { console.log("成功连接数据库"); var collection = db.collection('node-reptitle'); collection.insertMany(btLink, function (err,result) { //插入数据 if (err) { console.error(err); } else { console.log("保存数据成功"); } }) db.close(); } }); }The operation here is very simple, there is no need to use mongoose
Run again
node index
The above is the detailed content of How to implement a simple crawler using Node.js. For more information, please follow other related articles on the PHP Chinese website!

Python is more suitable for data science and automation, while JavaScript is more suitable for front-end and full-stack development. 1. Python performs well in data science and machine learning, using libraries such as NumPy and Pandas for data processing and modeling. 2. Python is concise and efficient in automation and scripting. 3. JavaScript is indispensable in front-end development and is used to build dynamic web pages and single-page applications. 4. JavaScript plays a role in back-end development through Node.js and supports full-stack development.

C and C play a vital role in the JavaScript engine, mainly used to implement interpreters and JIT compilers. 1) C is used to parse JavaScript source code and generate an abstract syntax tree. 2) C is responsible for generating and executing bytecode. 3) C implements the JIT compiler, optimizes and compiles hot-spot code at runtime, and significantly improves the execution efficiency of JavaScript.

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.

The main uses of JavaScript in web development include client interaction, form verification and asynchronous communication. 1) Dynamic content update and user interaction through DOM operations; 2) Client verification is carried out before the user submits data to improve the user experience; 3) Refreshless communication with the server is achieved through AJAX technology.

Understanding how JavaScript engine works internally is important to developers because it helps write more efficient code and understand performance bottlenecks and optimization strategies. 1) The engine's workflow includes three stages: parsing, compiling and execution; 2) During the execution process, the engine will perform dynamic optimization, such as inline cache and hidden classes; 3) Best practices include avoiding global variables, optimizing loops, using const and lets, and avoiding excessive use of closures.

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

The shift from C/C to JavaScript requires adapting to dynamic typing, garbage collection and asynchronous programming. 1) C/C is a statically typed language that requires manual memory management, while JavaScript is dynamically typed and garbage collection is automatically processed. 2) C/C needs to be compiled into machine code, while JavaScript is an interpreted language. 3) JavaScript introduces concepts such as closures, prototype chains and Promise, which enhances flexibility and asynchronous programming capabilities.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

SublimeText3 English version
Recommended: Win version, supports code prompts!

SublimeText3 Chinese version
Chinese version, very easy to use

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software