How to control the number of concurrencies using async and enterproxy
I believe concurrency is familiar to everyone. This article mainly introduces to you the relevant information about using async and enterproxy to control the number of concurrencies. The article introduces it in detail through sample code, which is of certain significance to everyone's study or work. The reference learning value, friends who need it, please study together below.
Let’s talk about concurrency and parallelism
Concurrency, in the operating system, means that several programs are started in a period of time Between running and completion, these programs are all running on the same processor, but only one program is running on the processor at any point in time.
Concurrency is something we often mention. Whether it is a web server or app, concurrency is everywhere. In the operating system, it refers to several programs in a period of time between starting and running, and these programs They all run on the same processor, and only one program is running on the processor at any point in time. Many websites have limits on the number of concurrent connections, so when requests are sent too quickly, the return value will be empty or an error will be reported. What's more, some websites may block your IP because you send too many concurrent connections and think you are making malicious requests.
Compared with concurrency, parallelism may be a lot unfamiliar. Parallelism refers to a group of programs executing at an independent and asynchronous speed, which is not equal to overlap in time (occurring at the same moment). Multiple CPU cores can be implemented by increasing the number of CPU cores. Programs (tasks) are carried out simultaneously. That's right, parallel tasks can be performed simultaneously
Use enterproxy to control the number of concurrencies
enterproxy is a tool that Pu Lingda made a major contribution to , bringing about a change in the thinking of event-based programming, using the event mechanism to decouple complex business logic, solving the criticism of the coupling of callback functions, turning serial waiting into parallel waiting, and improving execution efficiency in multi-asynchronous collaboration scenarios
How do we use enterproxy to control the number of concurrencies? Usually if we don't use enterproxy and homemade counters, if we grab three sources:
This deeply nested, serial way
var render = function (template, data) { _.template(template, data); }; $.get("template", function (template) { // something $.get("data", function (data) { // something $.get("l10n", function (l10n) { // something render(template, data, l10n); }); }); });
Remove this deep nesting in the past Method, our conventional way of writing is to maintain a counter by ourselves
(function(){ var count = 0; var result = {}; $.get('template',function(data){ result.data1 = data; count++; handle(); }) $.get('data',function(data){ result.data2 = data; count++; handle(); }) $.get('l10n',function(data){ result.data3 = data; count++; handle(); }) function handle(){ if(count === 3){ var html = fuck(result.data1,result.data2,result.data3); render(html); } } })();
Here, enterproxy can play the role of this counter. It helps you manage whether these asynchronous operations are completed. After completion, it will automatically call you to provide processing function, and pass the captured data as parameters
var ep = new enterproxy(); ep.all('data_event1','data_event2','data_event3',function(data1,data2,data3){ var html = fuck(data1,data2,data3); render(html); }) $.get('http:example1',function(data){ ep.emit('data_event1',data); }) $.get('http:example2',function(data){ ep.emit('data_event2',data); }) $.get('http:example3',function(data){ ep.emit('data_event3',data); })
enterproxy also provides APIs required for many other scenarios. You can learn this API by yourself enterproxy
Use async to control the number of concurrency
If we have 40 requests to make, many websites may think you are making a malicious request because you send too many concurrent connections. Block your IP.
So we always need to control the number of concurrency, and then slowly crawl these 40 links.
Use mapLimit in async to control the number of concurrency at one time to 5, and only capture 5 links at a time.
async.mapLimit(arr, 5, function (url, callback) { // something }, function (error, result) { console.log("result: ") console.log(result); })
We should first know what concurrency is, why we need to limit the number of concurrencies, and what solutions are available. Then you can go to the documentation to see how to use the API. The async documentation is a great way to learn these syntaxes.
Simulate a set of data, the data returned here is false, and the return delay is random.
var concurreyCount = 0; var fetchUrl = function(url,callback){ // delay 的值在 2000 以内,是个随机的整数 模拟延时 var delay = parseInt((Math.random()* 10000000) % 2000,10); concurreyCount++; console.log('现在并发数是 ' , concurreyCount , ' 正在抓取的是' , url , ' 耗时' + delay + '毫秒'); setTimeout(function(){ concurreyCount--; callback(null,url + ' html content'); },delay); } var urls = []; for(var i = 0;i<30;i++){ urls.push('http://datasource_' + i) }
Then we use async.mapLimit to crawl concurrently and obtain the results.
async.mapLimit(urls,5,function(url,callback){ fetchUrl(url,callbcak); },function(err,result){ console.log('result: '); console.log(result); })
The simulation is excerpted from alsotang
The following results are obtained after running the output
We found that the number of concurrency starts from 1, but increases to At 5 o'clock, it is no longer increasing. When there are tasks, continue to crawl, and the number of concurrent connections is always controlled at 5.
Complete the node simple crawler system
Because the concurrency controlled by eventproxy is used in the tutorial example of "node teaching is not guaranteed" by senior alsotang Quantity, let’s complete a simple node crawler that uses async to control the number of concurrencies.
The crawling target is the homepage of this site (manual face protection)
The first step, first we need to use the following modules:
url: used for url parsing, here url.resolve() is used to generate a legal domain name
async: a practical module that provides powerful functions and asynchronous JavaScript work
cheerio: A fast, flexible, jQuery core implementation specially customized for the server
superagent: A very convenient client request in nodejs Agent module
Install the dependent module through npm
The second step is to introduce the dependent module through require and determine the crawling object URL:
var url = require("url"); var async = require("async"); var cheerio = require("cheerio"); var superagent = require("superagent"); var baseUrl = 'http://www.chenqaq.com';
Step 3: Use superagent to request the target URL, and use cheerio to process baseUrl to get the target content URL, and save it in the array arr
superagent.get(baseUrl) .end(function (err, res) { if (err) { return console.error(err); } var arr = []; var $ = cheerio.load(res.text); // 下面和jQuery操作是一样一样的.. $(".post-list .post-title-link").each(function (idx, element) { $element = $(element); var _url = url.resolve(baseUrl, $element.attr("href")); arr.push(_url); }); // 验证得到的所有文章链接集合 output(arr); // 第四步:接下来遍历arr,解析每一个页面需要的信息 })
We need a function to verify the captured url object , it’s very simple. We only need a function to traverse arr and print it out:
function output(arr){ for(var i = 0;i<arr.length;i++){ console.log(arr[i]); } }
第四步:我们需要遍历得到的URL对象,解析每一个页面需要的信息。
这里就需要用到async控制并发数量,如果你上一步获取了一个庞大的arr数组,有多个url需要请求,如果同时发出多个请求,一些网站就可能会把你的行为当做恶意请求而封掉你的ip
async.mapLimit(arr,3,function(url,callback){ superagent.get(url) .end(function(err,mes){ if(err){ console.error(err); console.log('message info ' + JSON.stringify(mes)); } console.log('「fetch」' + url + ' successful!'); var $ = cheerio.load(mes.text); var jsonData = { title:$('.post-card-title').text().trim(), href: url, }; callback(null,jsonData); },function(error,results){ console.log('results '); console.log(results); }) })
得到上一步保存url地址的数组arr,限制最大并发数量为3,然后用一个回调函数处理 「该回调函数比较特殊,在iteratee方法中一定要调用该回调函数,有三种方式」
callback(null)
调用成功callback(null,data)
调用成功,并且返回数据data追加到resultscallback(data)
调用失败,不会再继续循环,直接到最后的callback
好了,到这里我们的node简易的小爬虫就完成了,来看看效果吧
嗨呀,首页数据好少,但是成功了呢。
参考资料
Node.js 包教不包会 - alsotang
enterproxy
async
async Documentation
上面是我整理给大家的,希望今后会对大家有帮助。
相关文章:
The above is the detailed content of How to control the number of concurrencies using async and enterproxy. For more information, please follow other related articles on the PHP Chinese website!

Introduction I know you may find it strange, what exactly does JavaScript, C and browser have to do? They seem to be unrelated, but in fact, they play a very important role in modern web development. Today we will discuss the close connection between these three. Through this article, you will learn how JavaScript runs in the browser, the role of C in the browser engine, and how they work together to drive rendering and interaction of web pages. We all know the relationship between JavaScript and browser. JavaScript is the core language of front-end development. It runs directly in the browser, making web pages vivid and interesting. Have you ever wondered why JavaScr

Node.js excels at efficient I/O, largely thanks to streams. Streams process data incrementally, avoiding memory overload—ideal for large files, network tasks, and real-time applications. Combining streams with TypeScript's type safety creates a powe

The differences in performance and efficiency between Python and JavaScript are mainly reflected in: 1) As an interpreted language, Python runs slowly but has high development efficiency and is suitable for rapid prototype development; 2) JavaScript is limited to single thread in the browser, but multi-threading and asynchronous I/O can be used to improve performance in Node.js, and both have advantages in actual projects.

JavaScript originated in 1995 and was created by Brandon Ike, and realized the language into C. 1.C language provides high performance and system-level programming capabilities for JavaScript. 2. JavaScript's memory management and performance optimization rely on C language. 3. The cross-platform feature of C language helps JavaScript run efficiently on different operating systems.

JavaScript runs in browsers and Node.js environments and relies on the JavaScript engine to parse and execute code. 1) Generate abstract syntax tree (AST) in the parsing stage; 2) convert AST into bytecode or machine code in the compilation stage; 3) execute the compiled code in the execution stage.

The future trends of Python and JavaScript include: 1. Python will consolidate its position in the fields of scientific computing and AI, 2. JavaScript will promote the development of web technology, 3. Cross-platform development will become a hot topic, and 4. Performance optimization will be the focus. Both will continue to expand application scenarios in their respective fields and make more breakthroughs in performance.

Both Python and JavaScript's choices in development environments are important. 1) Python's development environment includes PyCharm, JupyterNotebook and Anaconda, which are suitable for data science and rapid prototyping. 2) The development environment of JavaScript includes Node.js, VSCode and Webpack, which are suitable for front-end and back-end development. Choosing the right tools according to project needs can improve development efficiency and project success rate.

Yes, the engine core of JavaScript is written in C. 1) The C language provides efficient performance and underlying control, which is suitable for the development of JavaScript engine. 2) Taking the V8 engine as an example, its core is written in C, combining the efficiency and object-oriented characteristics of C. 3) The working principle of the JavaScript engine includes parsing, compiling and execution, and the C language plays a key role in these processes.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

WebStorm Mac version
Useful JavaScript development tools

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function
