This article mainly introduces the node crawling Lagou.com data and exporting it as an excel file. It has a certain reference value. Now I share it with everyone. Friends in need can refer to it
Preface
I have been learning node.js intermittently before. Today I will practice using Lagou.com and learn about the recent recruitment market through data! I am relatively new to node, and I hope I can learn and make progress together with everyone.
1. Summary
We first need to clarify the specific needs:
can be crawled through
node index city position
Related informationYou can also enter node index start to directly crawl our predefined city and position arrays, and loop to crawl different job information in different cities
Store the final crawling results in the local
./data
directory-
Generate the corresponding excel file and store it locally
2. Related modules used by crawlers
fs: used to read and write system files and directories
async: process control
superagent: client request proxy module
node-xlsx: Export files in a certain format as excel
3. The main steps of the crawler:
Initialize the project
New project directory
Create the project in the appropriate disk directory Directory node-crwl-lagou
Initialization project
Enter the node-crwl-lagou folder
Execute npm init and initialize the package.json file
Install dependent packages
npm install async
npm install superagent
npm install node-xlsx
Processing of command line input
For the content entered on the command line, you can use process.argv
to obtain it. It will return an array, and each item in the array is the content entered by the user.
Distinguish between node index regional position
and node index start
two inputs, the simplest is to determine the length of process.argv, if the length is four, directly call the crawler main program To crawl the data, if the length is three, we need to piece together the URL through the predefined city and position arrays, and then use async.mapSeries to call the main program in a loop. The homepage code for command analysis is as follows:
if (process.argv.length === 4) { let args = process.argv console.log('准备开始请求' + args[2] + '的' + args[3] + '职位数据'); requsetCrwl.controlRequest(args[2], args[3]) } else if (process.argv.length === 3 && process.argv[2] === 'start') { let arr = [] for (let i = 0; i <p>The predefined city and position arrays are as follows: </p><pre class="brush:php;toolbar:false">{ "city": ["北京","上海","广州","深圳","杭州","南京","成都","西安","武汉","重庆"], "position": ["前端","java","php","ios","android","c++","python",".NET"] }
The next step is the analysis of the main program part of the crawler.
Analyze the page and find the request address
First we open the homepage of Lagou.com, enter the query information (such as node), and then check the console to find the relevant request, as shown in the figure:
This post requesthttps://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false
is what we need, through three requests Parameters to obtain different data, a simple analysis can tell: the parameter first
is to mark whether the current page is the first page, true is yes, false is no; the parameter pn
is the current The page number; the parameter kd
is the query input content.
Requesting data through superagent
First of all, it needs to be clear that the entire program is asynchronous, and we need to use async.series to call it in sequence.
View the response returned by the analysis:
You can see that content.positionResult.totalCount is the total number of pages we need
We use superagent to directly call the post request , the console will prompt the following information:
{'success': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '122.xxx.xxx.xxx'}
This is actually one of the anti-crawler strategies. We only need to add a request header to it. The method of obtaining the request header is very simple, as follows:
Then use superagent to call the post request. The main code is as follows:
// 先获取总页数 (cb) => { superagent .post(`https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&city=${city}&kd=${position}&pn=1`) .send({ 'pn': 1, 'kd': position, 'first': true }) .set(options.options) .end((err, res) => { if (err) throw err // console.log(res.text) let resObj = JSON.parse(res.text) if (resObj.success === true) { totalPage = resObj.content.positionResult.totalCount; cb(null, totalPage); } else { console.log(`获取数据失败:${res.text}}`) } }) },
After getting the total number of pages, we can passTotal number of pages/15
Get the pn parameter, loop to generate all URLs and store them in urls:
(cb) => { for (let i=0;Math.ceil(i<totalpage><p>With all the URLs, it is not difficult to crawl to all the data. Continue to use the superagent's post method to loop requests For all URLs, every time the data is obtained, a json file is created in the data directory and the returned data is written. This seems simple, but there are two points to note: </p> <ol class=" list-paddingleft-2"> <li><p>In order to prevent too many concurrent requests from causing the IP to be blocked: when looping URLs, you need to use the async.mapLimit method to control the concurrency to 3. After each request, it will take two seconds before sending the next request</p></li> <li><p>在async.mapLimit的第四个参数中,需要通过判断调用主函数的第三个参数是否存在来区分一下是那种命令输入,因为对于<code>node index start</code>这个命令,我们使用得是async.mapSeries,每次调用主函数都传递了<code>(city, position, callback)</code>,所以如果是<code>node index start</code>的话,需要在每次获取数据完后将null传递回去,否则无法进行下一次循环</p></li> </ol> <p>主要代码如下:</p> <pre class="brush:php;toolbar:false">// 控制并发为3 (cb) => { async.mapLimit(urls, 3, (url, callback) => { num++; let page = url.split('&')[3].split('=')[1]; superagent .post(url) .send({ 'pn': totalPage, 'kd': position, 'first': false }) .set(options.options) .end((err, res) => { if (err) throw err let resObj = JSON.parse(res.text) if (resObj.success === true) { console.log(`正在抓取第${page}页,当前并发数量:${num}`); if (!fs.existsSync('./data')) { fs.mkdirSync('./data'); } // 将数据以.json格式储存在data文件夹下 fs.writeFile(`./data/${city}_${position}_${page}.json`, res.text, (err) => { if (err) throw err; // 写入数据完成后,两秒后再发送下一次请求 setTimeout(() => { num--; console.log(`第${page}页写入成功`); callback(null, 'success'); }, 2000); }); } }) }, (err, result) => { if (err) throw err; // 这个arguments是调用controlRequest函数的参数,可以区分是那种爬取(循环还是单个) if (arguments[2]) { ok = 1; } cb(null, ok) }) }, () => { if (ok) { setTimeout(function () { console.log(`${city}的${position}数据请求完成`); indexCallback(null); }, 5000); } else { console.log(`${city}的${position}数据请求完成`); } // exportExcel.exportExcel() // 导出为excel }
导出的json文件如下:
json文件导出为excel
将json文件导出为excel有多种方式,我使用的是node-xlsx
这个node包,这个包需要将数据按照固定的格式传入,然后导出即可,所以我们首先做的就是先拼出其所需的数据格式:
function exportExcel() { let list = fs.readdirSync('./data') let dataArr = [] list.forEach((item, index) => { let path = `./data/${item}` let obj = fs.readFileSync(path, 'utf-8') let content = JSON.parse(obj).content.positionResult.result let arr = [['companyFullName', 'createTime', 'workYear', 'education', 'city', 'positionName', 'positionAdvantage', 'companyLabelList', 'salary']] content.forEach((contentItem) => { arr.push([contentItem.companyFullName, contentItem.phone, contentItem.workYear, contentItem.education, contentItem.city, contentItem.positionName, contentItem.positionAdvantage, contentItem.companyLabelList.join(','), contentItem.salary]) }) dataArr[index] = { data: arr, name: path.split('./data/')[1] // 名字不能包含 \ / ? * [ ] } }) // 数据格式 // var data = [ // { // name : 'sheet1', // data : [ // [ // 'ID', // 'Name', // 'Score' // ], // [ // '1', // 'Michael', // '99' // // ], // [ // '2', // 'Jordan', // '98' // ] // ] // }, // { // name : 'sheet2', // data : [ // [ // 'AA', // 'BB' // ], // [ // '23', // '24' // ] // ] // } // ] // 写xlsx var buffer = xlsx.build(dataArr) fs.writeFile('./result.xlsx', buffer, function (err) { if (err) throw err; console.log('Write to xls has finished'); // 读xlsx // var obj = xlsx.parse("./" + "resut.xls"); // console.log(JSON.stringify(obj)); } ); }
导出的excel文件如下,每一页的数据都是一个sheet,比较清晰明了:
我们可以很清楚的从中看出目前西安.net的招聘情况,之后也可以考虑用更形象的图表方式展示爬到的数据,应该会更加直观!
总结
其实整个爬虫过程并不复杂,注意就是注意的小点很多,比如async的各个方法的使用以及导出设置header等,总之,也是收获满满哒!
源码
gitbug地址: https://github.com/fighting12...
以上就是本文的全部内容,希望对大家的学习有所帮助,更多相关内容请关注PHP中文网!
相关推荐:
The above is the detailed content of Node crawls Lagou.com data and exports it to an excel file. For more information, please follow other related articles on the PHP Chinese website!

The main difference between Python and JavaScript is the type system and application scenarios. 1. Python uses dynamic types, suitable for scientific computing and data analysis. 2. JavaScript adopts weak types and is widely used in front-end and full-stack development. The two have their own advantages in asynchronous programming and performance optimization, and should be decided according to project requirements when choosing.

Whether to choose Python or JavaScript depends on the project type: 1) Choose Python for data science and automation tasks; 2) Choose JavaScript for front-end and full-stack development. Python is favored for its powerful library in data processing and automation, while JavaScript is indispensable for its advantages in web interaction and full-stack development.

Python and JavaScript each have their own advantages, and the choice depends on project needs and personal preferences. 1. Python is easy to learn, with concise syntax, suitable for data science and back-end development, but has a slow execution speed. 2. JavaScript is everywhere in front-end development and has strong asynchronous programming capabilities. Node.js makes it suitable for full-stack development, but the syntax may be complex and error-prone.

JavaScriptisnotbuiltonCorC ;it'saninterpretedlanguagethatrunsonenginesoftenwritteninC .1)JavaScriptwasdesignedasalightweight,interpretedlanguageforwebbrowsers.2)EnginesevolvedfromsimpleinterpreterstoJITcompilers,typicallyinC ,improvingperformance.

JavaScript can be used for front-end and back-end development. The front-end enhances the user experience through DOM operations, and the back-end handles server tasks through Node.js. 1. Front-end example: Change the content of the web page text. 2. Backend example: Create a Node.js server.

Choosing Python or JavaScript should be based on career development, learning curve and ecosystem: 1) Career development: Python is suitable for data science and back-end development, while JavaScript is suitable for front-end and full-stack development. 2) Learning curve: Python syntax is concise and suitable for beginners; JavaScript syntax is flexible. 3) Ecosystem: Python has rich scientific computing libraries, and JavaScript has a powerful front-end framework.

The power of the JavaScript framework lies in simplifying development, improving user experience and application performance. When choosing a framework, consider: 1. Project size and complexity, 2. Team experience, 3. Ecosystem and community support.

Introduction I know you may find it strange, what exactly does JavaScript, C and browser have to do? They seem to be unrelated, but in fact, they play a very important role in modern web development. Today we will discuss the close connection between these three. Through this article, you will learn how JavaScript runs in the browser, the role of C in the browser engine, and how they work together to drive rendering and interaction of web pages. We all know the relationship between JavaScript and browser. JavaScript is the core language of front-end development. It runs directly in the browser, making web pages vivid and interesting. Have you ever wondered why JavaScr


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

SublimeText3 Linux new version
SublimeText3 Linux latest version

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.
