Node crawls Lagou.com data and exports it to an excel file-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

Node crawls Lagou.com data and exports it to an excel file

不言

Jul 07, 2018 pm 05:55 PM

javascriptnode.js

This article mainly introduces the node crawling Lagou.com data and exporting it as an excel file. It has a certain reference value. Now I share it with everyone. Friends in need can refer to it

Preface

I have been learning node.js intermittently before. Today I will practice using Lagou.com and learn about the recent recruitment market through data! I am relatively new to node, and I hope I can learn and make progress together with everyone.

1. Summary

We first need to clarify the specific needs:

can be crawled through node index city position Related information
You can also enter node index start to directly crawl our predefined city and position arrays, and loop to crawl different job information in different cities
Store the final crawling results in the local ./data directory
Generate the corresponding excel file and store it locally

fs: used to read and write system files and directories
async: process control
superagent: client request proxy module
node-xlsx: Export files in a certain format as excel

3. The main steps of the crawler:

Initialize the project

New project directory

Create the project in the appropriate disk directory Directory node-crwl-lagou

Initialization project

Enter the node-crwl-lagou folder

Execute npm init and initialize the package.json file

Install dependent packages

npm install async

npm install superagent

npm install node-xlsx

Processing of command line input

For the content entered on the command line, you can use process.argv to obtain it. It will return an array, and each item in the array is the content entered by the user.
Distinguish between node index regional position and node index start two inputs, the simplest is to determine the length of process.argv, if the length is four, directly call the crawler main program To crawl the data, if the length is three, we need to piece together the URL through the predefined city and position arrays, and then use async.mapSeries to call the main program in a loop. The homepage code for command analysis is as follows:

if (process.argv.length === 4) {
  let args = process.argv
  console.log('准备开始请求' + args[2] + '的' + args[3] + '职位数据');
  requsetCrwl.controlRequest(args[2], args[3])
} else if (process.argv.length === 3 && process.argv[2] === 'start') {
  let arr = []
  for (let i = 0; i <p>The predefined city and position arrays are as follows: </p><pre class="brush:php;toolbar:false">{
    "city": ["北京","上海","广州","深圳","杭州","南京","成都","西安","武汉","重庆"],
    "position": ["前端","java","php","ios","android","c++","python",".NET"]
}

The next step is the analysis of the main program part of the crawler.

Analyze the page and find the request address

First we open the homepage of Lagou.com, enter the query information (such as node), and then check the console to find the relevant request, as shown in the figure:

Node crawls Lagou.com data and exports it to an excel file

This post requesthttps://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false is what we need, through three requests Parameters to obtain different data, a simple analysis can tell: the parameter first is to mark whether the current page is the first page, true is yes, false is no; the parameter pn is the current The page number; the parameter kd is the query input content.

Requesting data through superagent

First of all, it needs to be clear that the entire program is asynchronous, and we need to use async.series to call it in sequence.
View the response returned by the analysis:

Node crawls Lagou.com data and exports it to an excel file

You can see that content.positionResult.totalCount is the total number of pages we need
We use superagent to directly call the post request , the console will prompt the following information:

{'success': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '122.xxx.xxx.xxx'}

This is actually one of the anti-crawler strategies. We only need to add a request header to it. The method of obtaining the request header is very simple, as follows:

Node crawls Lagou.com data and exports it to an excel file

Then use superagent to call the post request. The main code is as follows:

// 先获取总页数
    (cb) => {
      superagent
        .post(`https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&city=${city}&kd=${position}&pn=1`)
        .send({
          'pn': 1,
          'kd': position,
          'first': true
        })
        .set(options.options)
        .end((err, res) => {
          if (err) throw err
          // console.log(res.text)
          let resObj = JSON.parse(res.text)
          if (resObj.success === true) {
            totalPage = resObj.content.positionResult.totalCount;
            cb(null, totalPage);
          } else {
            console.log(`获取数据失败:${res.text}}`)
          }
        })
    },

After getting the total number of pages, we can passTotal number of pages/15Get the pn parameter, loop to generate all URLs and store them in urls:

(cb) => {
      for (let i=0;Math.ceil(i<totalpage><p>With all the URLs, it is not difficult to crawl to all the data. Continue to use the superagent's post method to loop requests For all URLs, every time the data is obtained, a json file is created in the data directory and the returned data is written. This seems simple, but there are two points to note: </p>
<ol class=" list-paddingleft-2">
<li><p>In order to prevent too many concurrent requests from causing the IP to be blocked: when looping URLs, you need to use the async.mapLimit method to control the concurrency to 3. After each request, it will take two seconds before sending the next request</p></li>
<li><p>在async.mapLimit的第四个参数中，需要通过判断调用主函数的第三个参数是否存在来区分一下是那种命令输入，因为对于<code>node index start</code>这个命令，我们使用得是async.mapSeries，每次调用主函数都传递了<code>(city, position, callback)</code>，所以如果是<code>node index start</code>的话，需要在每次获取数据完后将null传递回去，否则无法进行下一次循环</p></li>
</ol>
<p>主要代码如下：</p>
<pre class="brush:php;toolbar:false">// 控制并发为3
    (cb) => {
      async.mapLimit(urls, 3, (url, callback) => {
        num++;
        let page = url.split('&')[3].split('=')[1];
        superagent
          .post(url)
          .send({
            'pn': totalPage,
            'kd': position,
            'first': false
          })
          .set(options.options)
          .end((err, res) => {
            if (err) throw err
            let resObj = JSON.parse(res.text)
            if (resObj.success === true) {
              console.log(`正在抓取第${page}页，当前并发数量：${num}`);
              if (!fs.existsSync('./data')) {
                fs.mkdirSync('./data');
              }
              // 将数据以.json格式储存在data文件夹下
              fs.writeFile(`./data/${city}_${position}_${page}.json`, res.text, (err) => {
                if (err) throw err;
                // 写入数据完成后，两秒后再发送下一次请求
                setTimeout(() => {
                  num--;
                  console.log(`第${page}页写入成功`);
                  callback(null, 'success');
                }, 2000);
              });
            }
          })
      }, (err, result) => {
        if (err) throw err;
        // 这个arguments是调用controlRequest函数的参数，可以区分是那种爬取（循环还是单个）
        if (arguments[2]) {
          ok = 1;
        }
        cb(null, ok)
      })
    },
    () => {
      if (ok) {
        setTimeout(function () {
          console.log(`${city}的${position}数据请求完成`);
          indexCallback(null);
        }, 5000);
      } else {
        console.log(`${city}的${position}数据请求完成`);
      }
      // exportExcel.exportExcel() // 导出为excel
    }

导出的json文件如下：
Node crawls Lagou.com data and exports it to an excel file

json文件导出为excel

将json文件导出为excel有多种方式，我使用的是node-xlsx这个node包，这个包需要将数据按照固定的格式传入，然后导出即可，所以我们首先做的就是先拼出其所需的数据格式：

function exportExcel() {
  let list = fs.readdirSync('./data')
  let dataArr = []
  list.forEach((item, index) => {
    let path = `./data/${item}`
    let obj = fs.readFileSync(path, 'utf-8')
    let content = JSON.parse(obj).content.positionResult.result
    let arr = [['companyFullName', 'createTime', 'workYear', 'education', 'city', 'positionName', 'positionAdvantage', 'companyLabelList', 'salary']]
    content.forEach((contentItem) => {
      arr.push([contentItem.companyFullName, contentItem.phone, contentItem.workYear, contentItem.education, contentItem.city, contentItem.positionName, contentItem.positionAdvantage, contentItem.companyLabelList.join(','), contentItem.salary])
    })
    dataArr[index] = {
      data: arr,
      name: path.split('./data/')[1] // 名字不能包含 \ / ? * [ ]
    }
  })

// 数据格式
// var data = [
//   {
//     name : 'sheet1',
//     data : [
//       [
//         'ID',
//         'Name',
//         'Score'
//       ],
//       [
//         '1',
//         'Michael',
//         '99'
//
//       ],
//       [
//         '2',
//         'Jordan',
//         '98'
//       ]
//     ]
//   },
//   {
//     name : 'sheet2',
//     data : [
//       [
//         'AA',
//         'BB'
//       ],
//       [
//         '23',
//         '24'
//       ]
//     ]
//   }
// ]

// 写xlsx
  var buffer = xlsx.build(dataArr)
  fs.writeFile('./result.xlsx', buffer, function (err)
    {
      if (err)
        throw err;
      console.log('Write to xls has finished');

// 读xlsx
//     var obj = xlsx.parse("./" + "resut.xls");
//     console.log(JSON.stringify(obj));
    }
  );
}

导出的excel文件如下，每一页的数据都是一个sheet，比较清晰明了：
Node crawls Lagou.com data and exports it to an excel file

我们可以很清楚的从中看出目前西安.net的招聘情况，之后也可以考虑用更形象的图表方式展示爬到的数据，应该会更加直观！

总结

其实整个爬虫过程并不复杂，注意就是注意的小点很多，比如async的各个方法的使用以及导出设置header等，总之，也是收获满满哒！

源码

gitbug地址： https://github.com/fighting12...

以上就是本文的全部内容，希望对大家的学习有所帮助，更多相关内容请关注PHP中文网！

相关推荐：

react 官网动画库（react-transition-group）的新写法

原生JS基于window.scrollTo()封装垂直滚动动画工具函数

NodeList 和 HTMLCollection 和 Array的解析

The above is the detailed content of Node crawls Lagou.com data and exports it to an excel file. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python vs. JavaScript: A Comparative Analysis for DevelopersMay 09, 2025 am 12:22 AM

The main difference between Python and JavaScript is the type system and application scenarios. 1. Python uses dynamic types, suitable for scientific computing and data analysis. 2. JavaScript adopts weak types and is widely used in front-end and full-stack development. The two have their own advantages in asynchronous programming and performance optimization, and should be decided according to project requirements when choosing.

Python vs. JavaScript: Choosing the Right Tool for the JobMay 08, 2025 am 12:10 AM

Whether to choose Python or JavaScript depends on the project type: 1) Choose Python for data science and automation tasks; 2) Choose JavaScript for front-end and full-stack development. Python is favored for its powerful library in data processing and automation, while JavaScript is indispensable for its advantages in web interaction and full-stack development.

Python and JavaScript: Understanding the Strengths of EachMay 06, 2025 am 12:15 AM

Python and JavaScript each have their own advantages, and the choice depends on project needs and personal preferences. 1. Python is easy to learn, with concise syntax, suitable for data science and back-end development, but has a slow execution speed. 2. JavaScript is everywhere in front-end development and has strong asynchronous programming capabilities. Node.js makes it suitable for full-stack development, but the syntax may be complex and error-prone.

JavaScript's Core: Is It Built on C or C ?May 05, 2025 am 12:07 AM

JavaScriptisnotbuiltonCorC ;it'saninterpretedlanguagethatrunsonenginesoftenwritteninC .1)JavaScriptwasdesignedasalightweight,interpretedlanguageforwebbrowsers.2)EnginesevolvedfromsimpleinterpreterstoJITcompilers,typicallyinC ,improvingperformance.

JavaScript Applications: From Front-End to Back-EndMay 04, 2025 am 12:12 AM

JavaScript can be used for front-end and back-end development. The front-end enhances the user experience through DOM operations, and the back-end handles server tasks through Node.js. 1. Front-end example: Change the content of the web page text. 2. Backend example: Create a Node.js server.

Python vs. JavaScript: Which Language Should You Learn?May 03, 2025 am 12:10 AM

Choosing Python or JavaScript should be based on career development, learning curve and ecosystem: 1) Career development: Python is suitable for data science and back-end development, while JavaScript is suitable for front-end and full-stack development. 2) Learning curve: Python syntax is concise and suitable for beginners; JavaScript syntax is flexible. 3) Ecosystem: Python has rich scientific computing libraries, and JavaScript has a powerful front-end framework.

JavaScript Frameworks: Powering Modern Web DevelopmentMay 02, 2025 am 12:04 AM

The power of the JavaScript framework lies in simplifying development, improving user experience and application performance. When choosing a framework, consider: 1. Project size and complexity, 2. Team experience, 3. Ecosystem and community support.

The Relationship Between JavaScript, C , and BrowsersMay 01, 2025 am 12:06 AM

Introduction I know you may find it strange, what exactly does JavaScript, C and browser have to do? They seem to be unrelated, but in fact, they play a very important role in modern web development. Today we will discuss the close connection between these three. Through this article, you will learn how JavaScript runs in the browser, the role of C in the browser engine, and how they work together to drive rendering and interaction of web pages. We all know the relationship between JavaScript and browser. JavaScript is the core language of front-end development. It runs directly in the browser, making web pages vivid and interesting. Have you ever wondered why JavaScr

See all articles