Home  >  Q&A  >  body text

node.js - Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程/多进程还有存在的必要吗?








import aiohttp

async def fetch_async(a):
    async with aiohttp.request('GET', URL.format(a)) as r:
        data = await r.json()
    return data['args']['a']
start = time.time()
event_loop = asyncio.get_event_loop()
tasks = [fetch_async(num) for num in NUMBERS]
results = event_loop.run_until_complete(asyncio.gather(*tasks))

for num, result in zip(NUMBERS, results):
    print('fetch({}) = {}'.format(num, result))


async def fetch_async(a):
    async with aiohttp.request('GET', URL.format(a)) as r:
        data = await r.json()
    return a, data['args']['a']

def sub_loop(numbers):
    loop = asyncio.new_event_loop()
    tasks = [fetch_async(num) for num in numbers]
    results = loop.run_until_complete(asyncio.gather(*tasks))
    for num, result in results:
        print('fetch({}) = {}'.format(num, result))

async def run(executor, numbers):
    await asyncio.get_event_loop().run_in_executor(executor, sub_loop, numbers)

def chunks(l, size):
    n = math.ceil(len(l) / size)
    for i in range(0, len(l), n):
        yield l[i:i + n]                                                     

event_loop = asyncio.get_event_loop()
tasks = [run(executor, chunked) for chunked in chunks(NUMBERS, 3)]
results = event_loop.run_until_complete(asyncio.gather(*tasks))

print('Use asyncio+aiohttp+ThreadPoolExecutor cost: {}'.format(time.time() - start))

传统的requests + ThreadPoolExecutor比上面慢了3倍

import time
import requests
from concurrent.futures import ThreadPoolExecutor

NUMBERS = range(12)
URL = 'http://httpbin.org/get?a={}'

def fetch(a):
    r = requests.get(URL.format(a))
    return r.json()['args']['a']

start = time.time()
with ThreadPoolExecutor(max_workers=3) as executor:
    for num, result in zip(NUMBERS, executor.map(fetch, NUMBERS)):
        print('fetch({}) = {}'.format(num, result))

print('Use requests+ThreadPoolExecutor cost: {}'.format(time.time() - start))


如果Python拿不下GIL,我认为未来理想的模型应该是多进程 + 协程(asyncio+aiohttp)。uvloop和sanic以及500lines一个爬虫项目已经开始这么干了。不讨论兼容型问题,上面的看法是否正确,有一些什么场景协程无法取代多线程。

异步有很多方案,twisted, tornado等都有自己的解决方案,问题建立在asyncio+aiohttp的协程异步。


阿神阿神2714 days ago1081

reply all(6)I'll reply

  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:18:50

    I don’t know much about Python crawlers, but generally Scrapy is used to make crawlers. It is based on the twisted asynchronous framework.

    Multiple processes can make full use of multiple cores. Currently, the ideal one is multi-process + coroutine.

    Because the synchronous method is still used in requests, it will block the thread. In this case, it is meaningless to use asynchronous. You can understand it as using the time.sleep method instead of the asyncio.sleep method in asyncio.

  • 伊谢尔伦

    伊谢尔伦2017-04-18 10:18:50

    Check out this article: http://aosabook.org/en/500L/a...

  • PHP中文网

    PHP中文网2017-04-18 10:18:50

    asyncio adopts the idea of ​​coroutine, which is to process multiple asynchronous tasks in one thread. What are the asynchronous tasks, such as timing, asynchronous IO, etc.

    But what if the task does not support asynchronous?

    For example, reading and writing a blocking IO, or doing time-consuming a lot of calculations. Coroutines will solve the problem of task blocking, and the advantages of multi-process and multi-thread will be reflected.

    The usage scenarios of the two are different. Different scenarios, different plans.

  • PHP中文网

    PHP中文网2017-04-18 10:18:50

    asyncio requires related third-party library support, so basically all the original third-party libraries need to be written separately, such as serial ports, network protocols, including requests and http. In bad cases, after these two As of version time, many of the libraries used are already asynchronous. Includes requests.

  • PHPz

    PHPz2017-04-18 10:18:50

    asyncio needs an asynchronous API to support it (synchronous non-blocking API is also available, but Python does not have such a thing, you may need to hack it). setInterval

    If it is a synchronous blocking API, if one callback is stuck, other callbacks cannot be executed. You can take a look. The IO APIs you have seen so far are basically blocking.

  • 黄舟

    黄舟2017-04-18 10:18:50

    Python multi-threading is not practical due to the existence of GIL, but multi-process is still very useful

  • Cancelreply