search
HomeWeb Front-endJS TutorialExample tutorial on crawling MOOC course information
Example tutorial on crawling MOOC course informationJun 26, 2017 am 10:36 AM
javascriptnode.jsinformationreptilecourse

This is my first time learning Node.js crawler, so this is a simple crawler. The advantage of Node.js is that it can be executed concurrently.

This crawler is mainly to obtain Course information on MOOC.com and store the obtained information in a file, which uses the cheerio library, which allows us to conveniently operate HTML, just like using jQ

Before starting, remember

npm install cheerio

In order to crawl concurrently, the Promise object is used

//接受一个url爬取整个网页,返回一个Promise对象function getPageAsync(url){return new Promise((resolve,reject)=>{
        console.log(`正在爬取${url}的内容`);
        http.get(url,function(res){
            let html = '';

            res.on('data',function(data){
                html += data;
            });

            res.on('end',function(){
                resolve(html);
            });

            res.on('error',function(err){
                reject(err);
                console.log('错误信息:' + err);
            })
        });
    })
}

In MOOC, each course has an ID. We must write the ID of the course we want to get into an array in advance, and each The address of each course is the same address plus ID, so we only need to concatenate the address and ID to get the address of the course

const baseUrl = 'http://www.imooc.com/learn/';
const baseNuUrl = 'http://www.imooc.com/course/AjaxCourseMembers?ids=';//获取课程的IDconst videosId = [773,371];

In order to obtain concurrent execution when obtaining each course content, use the all method in Promise

Promise//当所有网页的内容爬取完毕    .all(courseArray)
    .then((pages)=>{//所有页面需要的内容let courseData = [];//遍历每个网页提取出所需要的内容pages.forEach((html)=>{
            let courses = filterChapter(html);
            courseData.push(courses);
        });//给每个courseMenners.number赋值for(let i=0;i<videosid.length>{return a.number > b.number;
        });//在重新将爬取内容写入文件中前,清空文件fs.writeFileSync(outputFile,'###爬取慕课网课程信息###',(err)=>{if(err){
                console.log(err)
            }
        });
        printfData(courseData);
    });</videosid.length>

In the then method, pages is the HTML page of each course. We have to extract the information we need from it. We need to use the following function

//接受一个爬取下来的网页内容,查找网页中需要的信息function filterChapter(html){
    const $ = cheerio.load(html);//所有章const chapters = $('.chapter');//课程的标题和学习人数let title = $('.hd>h2').text();
    let number = 0;//最后返回的数据//每个网页需要的内容的结构let courseData = {'title':title,'number':number,'videos':[]
    };

    chapters.each(function(item){
        let chapter = $(this);//文章标题let chapterTitle = Trim(chapter.find('strong').text(),'g');//每个章节的结构let chapterdata = {'chapterTitle':chapterTitle,'video':[]
        };//一个网页中的所有视频let videos = chapter.find('.video').children('li');
        videos.each(function(item){//视频标题let videoTitle = Trim($(this).find('a.J-media-item').text(),'g');//视频IDlet id = $(this).find('a').attr('href').split('video/')[1];
            chapterdata.video.push({'title':videoTitle,'id':id
            })
        });

        courseData.videos.push(chapterdata);

    });return courseData;
}

Note: In the above The number of students studying the course is set to 0 because the number of students studying the course is dynamically obtained using Ajax, so I wrote a method later to specifically obtain the number of students studying the course. The Trim() method used is to remove spaces in the text

The number of people who want to get the course:

//获取上课人数function getNumber(url){

    let datas = '';

    http.get(url,(res)=>{
        res.on('data',(chunk)=>{
            datas += chunk;
        });

        res.on('end',()=>{
            datas = JSON.parse(datas);
            courseMembers.push({'id':datas.data[0].id,'numbers':parseInt(datas.data[0].numbers,10)});
        });
    });
}

In this way, the number of people who want to get the course are added to the courseMembers array, at the end Assign the number of people studying the course to the corresponding course

        //给每个courseMenners.number赋值for(let i=0;i<videosid.length></videosid.length>

Once we have obtained the data, we must put it in a certain format Save to a file

//写入文件function writeFile(file,string) {
    fs.appendFileSync(file,string,(err)=>{if(err){
                console.log(err);
            }
        })
}//打印信息function printfData(coursesData){

    coursesData.forEach((courseData)=>{       // console.log(`${courseData.number}人学习过${courseData.title}\n`);       writeFile(outputFile,`\n\n${courseData.number}人学习过${courseData.title}\n\n`);

        courseData.videos.forEach(function(item){
            let chapterTitle = item.chapterTitle;// console.log(chapterTitle + '\n');            writeFile(outputFile,`\n  ${chapterTitle}\n`);

            item.video.forEach(function(item){// console.log('     【' + item.id + '】' + item.title + '\n');                writeFile(outputFile,`     【${item.id}】  ${item.title}\n`);
            })
        });

    });


}

The last data obtained:

Source code:

/**
 * Created by hp-pc on 2017/6/7 0007. */const http = require('http');
const fs = require('fs');
const cheerio = require('cheerio');
const baseUrl = 'http://www.imooc.com/learn/';
const baseNuUrl = 'http://www.imooc.com/course/AjaxCourseMembers?ids=';//获取课程的IDconst videosId = [773,371];//输出的文件const outputFile = 'test.txt';//记录学习课程的人数let courseMembers = [];//去除字符串中的空格function Trim(str,is_global)
{
    let  result;
    result = str.replace(/(^\s+)|(\s+$)/g,"");if(is_global.toLowerCase()=="g")
    {
        result = result.replace(/\s/g,"");
    }return result;
}//接受一个url爬取整个网页,返回一个Promise对象function getPageAsync(url){return new Promise((resolve,reject)=>{
        console.log(`正在爬取${url}的内容`);
        http.get(url,function(res){
            let html = '';

            res.on('data',function(data){
                html += data;
            });

            res.on('end',function(){
                resolve(html);
            });

            res.on('error',function(err){
                reject(err);
                console.log('错误信息:' + err);
            })
        });
    })
}//接受一个爬取下来的网页内容,查找网页中需要的信息function filterChapter(html){
    const $ = cheerio.load(html);//所有章const chapters = $('.chapter');//课程的标题和学习人数let title = $('.hd>h2').text();
    let number = 0;//最后返回的数据//每个网页需要的内容的结构let courseData = {'title':title,'number':number,'videos':[]
    };

    chapters.each(function(item){
        let chapter = $(this);//文章标题let chapterTitle = Trim(chapter.find('strong').text(),'g');//每个章节的结构let chapterdata = {'chapterTitle':chapterTitle,'video':[]
        };//一个网页中的所有视频let videos = chapter.find('.video').children('li');
        videos.each(function(item){//视频标题let videoTitle = Trim($(this).find('a.J-media-item').text(),'g');//视频IDlet id = $(this).find('a').attr('href').split('video/')[1];
            chapterdata.video.push({'title':videoTitle,'id':id
            })
        });

        courseData.videos.push(chapterdata);

    });return courseData;
}//获取上课人数function getNumber(url){

    let datas = '';

    http.get(url,(res)=>{
        res.on('data',(chunk)=>{
            datas += chunk;
        });

        res.on('end',()=>{
            datas = JSON.parse(datas);
            courseMembers.push({'id':datas.data[0].id,'numbers':parseInt(datas.data[0].numbers,10)});
        });
    });
}//写入文件function writeFile(file,string) {
    fs.appendFileSync(file,string,(err)=>{if(err){
                console.log(err);
            }
        })
}//打印信息function printfData(coursesData){

    coursesData.forEach((courseData)=>{       // console.log(`${courseData.number}人学习过${courseData.title}\n`);       writeFile(outputFile,`\n\n${courseData.number}人学习过${courseData.title}\n\n`);

        courseData.videos.forEach(function(item){
            let chapterTitle = item.chapterTitle;// console.log(chapterTitle + '\n');            writeFile(outputFile,`\n  ${chapterTitle}\n`);

            item.video.forEach(function(item){// console.log('     【' + item.id + '】' + item.title + '\n');                writeFile(outputFile,`     【${item.id}】  ${item.title}\n`);
            })
        });

    });


}//所有页面爬取完后返回的Promise数组let courseArray = [];//循环所有的videosId,和baseUrl进行字符串拼接,爬取网页内容videosId.forEach((id)=>{//将爬取网页完毕后返回的Promise对象加入数组courseArray.push(getPageAsync(baseUrl + id));//获取学习的人数getNumber(baseNuUrl + id);
});

Promise//当所有网页的内容爬取完毕    .all(courseArray)
    .then((pages)=>{//所有页面需要的内容let courseData = [];//遍历每个网页提取出所需要的内容pages.forEach((html)=>{
            let courses = filterChapter(html);
            courseData.push(courses);
        });//给每个courseMenners.number赋值for(let i=0;i<videosid.length>{return a.number > b.number;
        });//在重新将爬取内容写入文件中前,清空文件fs.writeFileSync(outputFile,'###爬取慕课网课程信息###',(err)=>{if(err){
                console.log(err)
            }
        });
        printfData(courseData);
    });</videosid.length>

The above is the detailed content of Example tutorial on crawling MOOC course information. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
es6数组怎么去掉重复并且重新排序es6数组怎么去掉重复并且重新排序May 05, 2022 pm 07:08 PM

去掉重复并排序的方法:1、使用“Array.from(new Set(arr))”或者“[…new Set(arr)]”语句,去掉数组中的重复元素,返回去重后的新数组;2、利用sort()对去重数组进行排序,语法“去重数组.sort()”。

JavaScript的Symbol类型、隐藏属性及全局注册表详解JavaScript的Symbol类型、隐藏属性及全局注册表详解Jun 02, 2022 am 11:50 AM

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于Symbol类型、隐藏属性及全局注册表的相关问题,包括了Symbol类型的描述、Symbol不会隐式转字符串等问题,下面一起来看一下,希望对大家有帮助。

原来利用纯CSS也能实现文字轮播与图片轮播!原来利用纯CSS也能实现文字轮播与图片轮播!Jun 10, 2022 pm 01:00 PM

怎么制作文字轮播与图片轮播?大家第一想到的是不是利用js,其实利用纯CSS也能实现文字轮播与图片轮播,下面来看看实现方法,希望对大家有所帮助!

JavaScript对象的构造函数和new操作符(实例详解)JavaScript对象的构造函数和new操作符(实例详解)May 10, 2022 pm 06:16 PM

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于对象的构造函数和new操作符,构造函数是所有对象的成员方法中,最早被调用的那个,下面一起来看一下吧,希望对大家有帮助。

javascript怎么移除元素点击事件javascript怎么移除元素点击事件Apr 11, 2022 pm 04:51 PM

方法:1、利用“点击元素对象.unbind("click");”方法,该方法可以移除被选元素的事件处理程序;2、利用“点击元素对象.off("click");”方法,该方法可以移除通过on()方法添加的事件处理程序。

JavaScript面向对象详细解析之属性描述符JavaScript面向对象详细解析之属性描述符May 27, 2022 pm 05:29 PM

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于面向对象的相关问题,包括了属性描述符、数据描述符、存取描述符等等内容,下面一起来看一下,希望对大家有帮助。

foreach是es6里的吗foreach是es6里的吗May 05, 2022 pm 05:59 PM

foreach不是es6的方法。foreach是es3中一个遍历数组的方法,可以调用数组的每个元素,并将元素传给回调函数进行处理,语法“array.forEach(function(当前元素,索引,数组){...})”;该方法不处理空数组。

整理总结JavaScript常见的BOM操作整理总结JavaScript常见的BOM操作Jun 01, 2022 am 11:43 AM

本篇文章给大家带来了关于JavaScript的相关知识,其中主要介绍了关于BOM操作的相关问题,包括了window对象的常见事件、JavaScript执行机制等等相关内容,下面一起来看一下,希望对大家有帮助。

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools

VSCode Windows 64-bit Download

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft