This article mainly introduces the knowledge related to superagent and cheerio for the initial trial of nodejs crawler. It is very good and has reference value. Friends in need can refer to it
Preface
I have heard of crawlers for a long time. I started to learn nodejs in the past few days and wrote a crawler https://github.com/leichangchun/node-crawlers/tree/master/superagent_cheerio_demo to crawl the homepage of the blog park. Here is a brief summary of the article title, user name, number of reads, number of recommendations and user avatar.
Use these points:
1. The core module of node--file system
2. The third-party module used for http requests--superagent
3. Third-party module for parsing DOM--cheerio
Please go to each link for detailed explanations and APIs of several modules. There are only simple usages in the demo.
Preparation work
Use npm to manage dependencies, and the dependency information will be stored in package.json
//安装用到的第三方模块 cnpm install --save superagent cheerio
Introduction required Function module used
//引入第三方模块,superagent用于http请求,cheerio用于解析DOM const request = require('superagent'); const cheerio = require('cheerio'); const fs = require('fs');
Request parsing page
If you want to crawl to the content of the blog park homepage, you must first request the homepage address, get To return the HTML, superagent is used here to make http requests. The basic usage method is as follows:
request.get(url) .end(error,res){ //do something }
Initiate a get request to the specified URL. When the request is wrong, an error will be returned (if there is no error, the error is null or undefined), res is the returned data.
After getting the html content, we need to use cheerio to parse the DOM in order to get the data we want. Cheerio must first load the target html and then parse it. The API is very similar to the jquery API. , familiar with jquery and getting started very quickly. Look directly at the code example
//目标链接 博客园首页 let targetUrl = 'https://www.cnblogs.com/'; //用来暂时保存解析到的内容和图片地址数据 let content = ''; let imgs = []; //发起请求 request.get(targetUrl) .end( (error,res) => { if(error){ //请求出错,打印错误,返回 console.log(error) return; } // cheerio需要先load html let $ = cheerio.load(res.text); //抓取需要的数据,each为cheerio提供的方法用来遍历 $('#post_list .post_item').each( (index,element) => { //分析所需要的数据的DOM结构 //通过选择器定位到目标元素,再获取到数据 let temp = { '标题' : $(element).find('h3 a').text(), '作者' : $(element).find('.post_item_foot > a').text(), '阅读数' : +$(element).find('.article_view a').text().slice(3,-2), '推荐数' : +$(element).find('.diggnum').text() } //拼接数据 content += JSON.stringify(temp) + '\n'; //同样的方式获取图片地址 if($(element).find('img.pfs').length > 0){ imgs.push($(element).find('img.pfs').attr('src')); } }); //存放数据 mkdir('./content',saveContent); mkdir('./imgs',downloadImg); })
Storing data
After parsing the DOM above, the required information content has been spliced and the image has been obtained URL, store it now, store the content in a txt file in the specified directory, and download the image to the specified directory
Create the directory first and use the nodejs core file system
//创建目录 function mkdir(_path,callback){ if(fs.existsSync(_path)){ console.log(`${_path}目录已存在`) }else{ fs.mkdir(_path,(error)=>{ if(error){ return console.log(`创建${_path}目录失败`); } console.log(`创建${_path}目录成功`) }) } callback(); //没有生成指定目录不会执行 }
With the specification After the directory, you can write data. The content of the txt file is already there. Just write it directly. Use writeFile()
//将文字内容存入txt文件中 function saveContent() { fs.writeFile('./content/content.txt',content.toString()); }
to get the link to the picture, so you need to use superagent to download the picture and save it locally. . superagent can directly return a response stream, and then cooperate with the nodejs pipeline to directly write the image content to the local
//下载爬到的图片 function downloadImg() { imgs.forEach((imgUrl,index) => { //获取图片名 let imgName = imgUrl.split('/').pop(); //下载图片存放到指定目录 let stream = fs.createWriteStream(`./imgs/${imgName}`); let req = request.get('https:' + imgUrl); //响应流 req.pipe(stream); console.log(`开始下载图片 https:${imgUrl} --> ./imgs/${imgName}`); } ) }
Effect
Execute the demo and see the effect. The data has climbed down normally
A very simple demo, it may not be that rigorous, but it is always the first small step towards node.
The above is what I compiled for everyone. I hope it will be helpful to everyone in the future.
Related articles:
Vue’s routing dynamic redirection and navigation guard examples
Vue implements uploading images to the database and displaying them to Page function example
Solve the problem of failure after using vue.js routing
The above is the detailed content of How to use superagent and cheerio using nodejs crawler. For more information, please follow other related articles on the PHP Chinese website!

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.

Yes, the engine core of JavaScript is written in C. 1) The C language provides efficient performance and underlying control, which is suitable for the development of JavaScript engine. 2) Taking the V8 engine as an example, its core is written in C, combining the efficiency and object-oriented characteristics of C. 3) The working principle of the JavaScript engine includes parsing, compiling and execution, and the C language plays a key role in these processes.

JavaScript is at the heart of modern websites because it enhances the interactivity and dynamicity of web pages. 1) It allows to change content without refreshing the page, 2) manipulate web pages through DOMAPI, 3) support complex interactive effects such as animation and drag-and-drop, 4) optimize performance and best practices to improve user experience.

C and JavaScript achieve interoperability through WebAssembly. 1) C code is compiled into WebAssembly module and introduced into JavaScript environment to enhance computing power. 2) In game development, C handles physics engines and graphics rendering, and JavaScript is responsible for game logic and user interface.

JavaScript is widely used in websites, mobile applications, desktop applications and server-side programming. 1) In website development, JavaScript operates DOM together with HTML and CSS to achieve dynamic effects and supports frameworks such as jQuery and React. 2) Through ReactNative and Ionic, JavaScript is used to develop cross-platform mobile applications. 3) The Electron framework enables JavaScript to build desktop applications. 4) Node.js allows JavaScript to run on the server side and supports high concurrent requests.

Python is more suitable for data science and automation, while JavaScript is more suitable for front-end and full-stack development. 1. Python performs well in data science and machine learning, using libraries such as NumPy and Pandas for data processing and modeling. 2. Python is concise and efficient in automation and scripting. 3. JavaScript is indispensable in front-end development and is used to build dynamic web pages and single-page applications. 4. JavaScript plays a role in back-end development through Node.js and supports full-stack development.

C and C play a vital role in the JavaScript engine, mainly used to implement interpreters and JIT compilers. 1) C is used to parse JavaScript source code and generate an abstract syntax tree. 2) C is responsible for generating and executing bytecode. 3) C implements the JIT compiler, optimizes and compiles hot-spot code at runtime, and significantly improves the execution efficiency of JavaScript.

JavaScript's application in the real world includes front-end and back-end development. 1) Display front-end applications by building a TODO list application, involving DOM operations and event processing. 2) Build RESTfulAPI through Node.js and Express to demonstrate back-end applications.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft
