


node.js basic module http, web page analysis tool cherrio implement crawler_node.js
1. Foreword
It is said that it is a preliminary exploration of crawlers. In fact, it does not use third-party libraries related to crawlers. It mainly uses the node.js basic module http and the web page analysis tool cherrio. Use http to directly obtain the web page resource corresponding to the url path, and then use cherrio to analyze it. Here I have typed the main cases I have studied to deepen my understanding. During the coding process, for the first time, I directly traversed the object obtained by jq using forEach and reported an error directly. This was because jq did not have a corresponding method and only js arrays could be called.
2. Knowledge points
①: Superagent is a tool for grabbing web pages. I haven't used it yet.
②: cherrio web analysis tool, you can understand it as jQuery on the server side, because the syntax is the same.
Rendering
1. Capture the entire webpage
2. Analyzed data, The examples provided are examples of case implementation.
Initial source code analysis of crawler
var http=require('http'); var cheerio=require('cheerio'); var url='http://www.imooc.com/learn/348'; /**************************** 打印得到的数据结构 [{ chapterTitle:'', videos:[{ title:'', id:'' }] }] ********************************/ function printCourseInfo(courseData){ courseData.forEach(function(item){ var chapterTitle=item.chapterTitle; console.log(chapterTitle+'\n'); item.videos.forEach(function(video){ console.log(' 【'+video.id+'】'+video.title+'\n'); }) }); } /************* 分析从网页里抓取到的数据 **************/ function filterChapter(html){ var courseData=[]; var $=cheerio.load(html); var chapters=$('.chapter'); chapters.each(function(item){ var chapter=$(this); var chapterTitle=chapter.find('strong').text(); //找到章节标题 var videos=chapter.find('.video').children('li'); var chapterData={ chapterTitle:chapterTitle, videos:[] }; videos.each(function(item){ var video=$(this).find('.studyvideo'); var title=video.text(); var id=video.attr('href').split('/video')[1]; chapterData.videos.push({ title:title, id:id }) }) courseData.push(chapterData); }); return courseData; } http.get(url,function(res){ var html=''; res.on('data',function(data){ html+=data; }) res.on('end',function(){ var courseData=filterChapter(html); printCourseInfo(courseData); }) }).on('error',function(){ console.log('获取课程数据出错'); })
Reference:
https://github.com/alsotang/node-lessons/tree/master/lesson3

The latest trends in JavaScript include the rise of TypeScript, the popularity of modern frameworks and libraries, and the application of WebAssembly. Future prospects cover more powerful type systems, the development of server-side JavaScript, the expansion of artificial intelligence and machine learning, and the potential of IoT and edge computing.

JavaScript is the cornerstone of modern web development, and its main functions include event-driven programming, dynamic content generation and asynchronous programming. 1) Event-driven programming allows web pages to change dynamically according to user operations. 2) Dynamic content generation allows page content to be adjusted according to conditions. 3) Asynchronous programming ensures that the user interface is not blocked. JavaScript is widely used in web interaction, single-page application and server-side development, greatly improving the flexibility of user experience and cross-platform development.

Python is more suitable for data science and machine learning, while JavaScript is more suitable for front-end and full-stack development. 1. Python is known for its concise syntax and rich library ecosystem, and is suitable for data analysis and web development. 2. JavaScript is the core of front-end development. Node.js supports server-side programming and is suitable for full-stack development.

JavaScript does not require installation because it is already built into modern browsers. You just need a text editor and a browser to get started. 1) In the browser environment, run it by embedding the HTML file through tags. 2) In the Node.js environment, after downloading and installing Node.js, run the JavaScript file through the command line.

How to send task notifications in Quartz In advance When using the Quartz timer to schedule a task, the execution time of the task is set by the cron expression. Now...

How to obtain the parameters of functions on prototype chains in JavaScript In JavaScript programming, understanding and manipulating function parameters on prototype chains is a common and important task...

Analysis of the reason why the dynamic style displacement failure of using Vue.js in the WeChat applet web-view is using Vue.js...

How to make concurrent GET requests for multiple links and judge in sequence to return results? In Tampermonkey scripts, we often need to use multiple chains...


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version
SublimeText3 Linux latest version

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Chinese version
Chinese version, very easy to use

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.