How to use Node.js crawler to implement web page requests-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

How to use Node.js crawler to implement web page requests

亚连

Jun 12, 2018 pm 02:54 PM

node.js

This article mainly introduces the web request module of Node.js crawler. Now I will share it with you and give you a reference.

This article introduces the web request module of Node.js crawler and shares it with everyone. The details are as follows:

Note: If you download the latest nodegrass version, since some methods have been updated, the examples in this article It is no longer suitable. Please see the examples in the open source address for details.

1. Why should I write such a module?

The author wants to use Node.js to write a crawler. Although the method provided by Node.js official API to request remote resources is very simple, please refer to

http:// nodejs.org/api/http.html Two methods are provided for Http requests: http.get(options, callback) and http.request(options, callback).

You will know by looking at the method, get The method is used for get requests, while the request method provides more parameters, such as other request methods, the port of the requesting host, etc. Requests for Https are similar to Http. The simplest example:

var https = require(&#39;https&#39;);
https.get(&#39;https://encrypted.google.com/&#39;, function(res) {
 console.log("statusCode: ", res.statusCode);
 console.log("headers: ", res.headers);

 res.on(&#39;data&#39;, function(d) {
  process.stdout.write(d);
 });

}).on(&#39;error&#39;, function(e) {
 console.error(e);
});

For the above code, we just want to request the remote host and get the response information, such as response status, response header, and response body content. The second parameter of the get method is a callback function. We obtain the response information asynchronously. Then, in the callback function, the res object listens to data. The second parameter in the on method is another callback, and you get d (the response information you requested), it is very likely that callbacks will be introduced again when operating it, layer by layer, and finally faint. . . Regarding asynchronous programming, some students who are used to writing code in a synchronous way are very confused. Of course, some excellent synchronization libraries have been provided at home and abroad, such as Lao Zhao's Wind.js... It seems It's a bit far-fetched. In fact, what we ultimately want to get when calling get is the response information, and we don't care about the monitoring process such as res.on because it is too lazy. I don’t want to res.on('data',func) every time, so the nodegrass I want to introduce today was born.

2. Nodegrass requests resources, like Jquery’s $.get(url,func)

The simplest example:

var nodegrass = require(&#39;nodegrass&#39;);
nodegrass.get("http://www.baidu.com",function(data,status,headers){
  console.log(status);
  console.log(headers);
  console.log(data);
},&#39;gbk&#39;).on(&#39;error&#39;, function(e) {
  console.log("Got error: " + e.message);
});

What one Look, it’s no different from the official original get, it’s indeed almost the same =. =! It just lacks a layer of event monitoring callbacks of res.on('data',func). Believe it or not, I seem to feel much more comfortable anyway. The second parameter is also a callback function, in which the parameter data is the response body content, status is the response status, and headers are the response headers. After getting the response content, we can extract any information we are interested in from the obtained resources. Of course, in this example, it is just a simple printed console. The third parameter is the character encoding. Currently, Node.js does not support gbk. Nodegrass internally refers to iconv-lite for processing. Therefore, if the webpage encoding you request is gbk, such as Baidu. Just add this parameter.

So what about https requests? If it is an official api, you have to introduce the https module, but the request get method is similar to http, so nodegrass integrates them by the way. Look at the example:

var nodegrass = require(&#39;nodegrass&#39;);
nodegrass.get("https://github.com",function(data,status,headers){
  console.log(status);
  console.log(headers);
  console.log(data);
},&#39;utf8&#39;).on(&#39;error&#39;, function(e) {
  console.log("Got error: " + e.message);
});

nodegrass will automatically identify whether it is http or https based on the url. Of course, your url must have it. You cannot just write www.baidu.com/ but http://www.baidu.com/ .

For post requests, nodegrass provides the post method. See example:

var ng=require(&#39;nodegrass&#39;);
ng.post("https://api.weibo.com/oauth2/access_token",function(data,status,headers){
  var accessToken = JSON.parse(data);
  var err = null;
  if(accessToken.error){
     err = accessToken;
  }
  callback(err,accessToken);
  },headers,options,&#39;utf8&#39;);

The above is part of Sina Weibo Auth2.0 requesting accessToken, which uses nodegrass's post request access_token API.

Compared with the get method, the post method provides more headers request header parameters and options--post data. They are all types of object literals:

var headers = {
    &#39;Content-Type&#39;: &#39;application/x-www-form-urlencoded&#39;,
    &#39;Content-Length&#39;:data.length
  };

var options = {
       client_id : &#39;id&#39;,
     client_secret : &#39;cs&#39;,
     grant_type : &#39;authorization_code&#39;,
     redirect_uri : &#39;your callback url&#39;,
     code: acode
  };

3. Using nodegrass Be a proxy server? ……**

Look at the example:

var ng = require(&#39;nodegrass&#39;),
   http=require(&#39;http&#39;),
   url=require(&#39;url&#39;);

   http.createServer(function(req,res){
    var pathname = url.parse(req.url).pathname;
    
    if(pathname === &#39;/&#39;){
      ng.get(&#39;http://www.cnblogs.com/&#39;,function(data){
        res.writeHeader(200,{&#39;Content-Type&#39;:&#39;text/html;charset=utf-8&#39;});
        res.write(data+"\n");
        res.end();
        },&#39;utf8&#39;);
      }
   }).listen(8088);
   console.log(&#39;server listening 8088...&#39;);

It’s that simple. Of course, the proxy server is much more complicated. This does not count, but at least if you access the local port 8088, you will see Is it the page of the Blog Park?

The open source address of nodegrass: https://github.com/scottkiss/nodegrass

The above is what I compiled for everyone. I hope it will be helpful to everyone in the future.

JavaScript recursive traversal and non-recursive traversal

How to use the Upload upload component of element-ui in vue

How to implement calling between methods in vue

The above is the detailed content of How to use Node.js crawler to implement web page requests. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Node.js Streams with TypeScriptApr 30, 2025 am 08:22 AM

Node.js excels at efficient I/O, largely thanks to streams. Streams process data incrementally, avoiding memory overload—ideal for large files, network tasks, and real-time applications. Combining streams with TypeScript's type safety creates a powe

Python vs. JavaScript: Performance and Efficiency ConsiderationsApr 30, 2025 am 12:08 AM

The differences in performance and efficiency between Python and JavaScript are mainly reflected in: 1) As an interpreted language, Python runs slowly but has high development efficiency and is suitable for rapid prototype development; 2) JavaScript is limited to single thread in the browser, but multi-threading and asynchronous I/O can be used to improve performance in Node.js, and both have advantages in actual projects.

The Origins of JavaScript: Exploring Its Implementation LanguageApr 29, 2025 am 12:51 AM

JavaScript originated in 1995 and was created by Brandon Ike, and realized the language into C. 1.C language provides high performance and system-level programming capabilities for JavaScript. 2. JavaScript's memory management and performance optimization rely on C language. 3. The cross-platform feature of C language helps JavaScript run efficiently on different operating systems.

Behind the Scenes: What Language Powers JavaScript?Apr 28, 2025 am 12:01 AM

JavaScript runs in browsers and Node.js environments and relies on the JavaScript engine to parse and execute code. 1) Generate abstract syntax tree (AST) in the parsing stage; 2) convert AST into bytecode or machine code in the compilation stage; 3) execute the compiled code in the execution stage.

The Future of Python and JavaScript: Trends and PredictionsApr 27, 2025 am 12:21 AM

The future trends of Python and JavaScript include: 1. Python will consolidate its position in the fields of scientific computing and AI, 2. JavaScript will promote the development of web technology, 3. Cross-platform development will become a hot topic, and 4. Performance optimization will be the focus. Both will continue to expand application scenarios in their respective fields and make more breakthroughs in performance.

Python vs. JavaScript: Development Environments and ToolsApr 26, 2025 am 12:09 AM

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.

Is JavaScript Written in C? Examining the EvidenceApr 25, 2025 am 12:15 AM

Yes, the engine core of JavaScript is written in C. 1) The C language provides efficient performance and underlying control, which is suitable for the development of JavaScript engine. 2) Taking the V8 engine as an example, its core is written in C, combining the efficiency and object-oriented characteristics of C. 3) The working principle of the JavaScript engine includes parsing, compiling and execution, and the C language plays a key role in these processes.

JavaScript's Role: Making the Web Interactive and DynamicApr 24, 2025 am 12:12 AM

JavaScript is at the heart of modern websites because it enhances the interactivity and dynamicity of web pages. 1) It allows to change content without refreshing the page, 2) manipulate web pages through DOMAPI, 3) support complex interactive effects such as animation and drag-and-drop, 4) optimize performance and best practices to improve user experience.

See all articles