Home  >  Article  >  Backend Development  >  Detailed explanation of how Python crawler DNS resolves cache

Detailed explanation of how Python crawler DNS resolves cache

黄舟
黄舟Original
2017-06-04 10:10:482081browse

This article mainly introduces the Pythoncrawler DNS parsingcachingmethod, combined with specific examples to analyze the related operating skills and notes of Python using the socket module to parse the DNS cache. Matter , friends in need can refer to

This article describes the Python crawler DNS resolution caching method with examples. Share it with everyone for your reference, the details are as follows:

Foreword:

This is the core code in the DNS resolution cache module in the Python crawler, It's last year's code, and it's now available for those who are interested to take a look.

Generally, the DNS resolution time of a domain name is between 10 and 60 milliseconds. This may seem insignificant, but for larger crawlers, this cannot be ignored. For example, if we want to crawl Sina Weibo, there are 10 million requests under the same domain name (which is not too much), so it takes between 100,000 and 600,000 seconds, which is only 86,400 seconds per day. In other words, DNS resolution alone takes several days. At this time, adding DNS resolution caching, the effect is obvious.

Put the code directly below, and the instructions are at the back.

Code:

# encoding=utf-8
# ---------------------------------------
#  版本:0.1
#  日期:2016-04-26
#  作者:九茶<bone_ace@163.com>
#  开发环境:Win64 + Python 2.7
# ---------------------------------------
import socket
# from gevent import socket
_dnscache = {}
def _setDNSCache():
  """ DNS缓存 """
  def _getaddrinfo(*args, **kwargs):
    if args in _dnscache:
      # print str(args) + " in cache"
      return _dnscache[args]
    else:
      # print str(args) + " not in cache"
      _dnscache[args] = socket._getaddrinfo(*args, **kwargs)
      return _dnscache[args]
  if not hasattr(socket, &#39;_getaddrinfo&#39;):
    socket._getaddrinfo = socket.getaddrinfo
    socket.getaddrinfo = _getaddrinfo

Description:

It’s actually nothing The difficulty is to save the cache in the socket to avoid repeated acquisition.
You can put the above code in a dns_cache.py file, and just call this _setDNSCache() method in the crawler framework .

It should be noted that if you use the gevent coroutine and use mon<a href="http://www.php.cn/wiki/1051.html" target="_blank">key</a>.patch_<a href="http://www.php.cn/wiki/1483.html" target="_blank">all</a>(), It should be noted that the crawler has switched to the socket in gevent at this time, and the DNS resolution cache module should also use the gevent socket.

The above is the detailed content of Detailed explanation of how Python crawler DNS resolves cache. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn