search

Home  >  Q&A  >  body text

python爬虫如何批量爬取糗事百科段子

刚学Python不会scrapy框架,就是想做个简单爬虫实现抓取前10页段子(前N页)。请问不用scrapy能有什么简单一些的代码能实现?之前有试过在page那里加for循环,但是也只能抓到一个页面,不知道怎么弄。

import urllib
import urllib2
import re

page = 1
url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
user_agent = 'Mozilla/5.0 ( Windows NT 6.1)'
headers = { 'User-Agent' : user_agent }

try:
    request = urllib2.Request(url,headers = headers)
    response = urllib2.urlopen(request)
    content = response.read().decode('utf-8')
    pattern = re.compile('<p.*?class="content">.*?<span>(.*?)</span>.*?</p>.*?',re.S)
    items = re.findall(pattern,content)
    for item in items:
        print item

except urllib2.URLError, e:
    if hasattr(e,"code"):
        print e.code
    if hasattr(e,"reason"):
        print e.reason
伊谢尔伦伊谢尔伦2781 days ago777

reply all(1)I'll reply

  • 天蓬老师

    天蓬老师2017-04-18 10:22:18

    I ran your code and found that it ran out of the first two pages, but returned an error code after that. I think it’s because you didn’t do anti-crawling processing, because your result came out within one second. , 10 consecutive visits within one second is definitely not something that humans can do.

    Many websites can know that you are using code to brush their website. Some websites hate this and will do anti-crawling. They may directly block your IP so that you can’t access it, because if you don’t do this, Yes, if you directly access it too many times in a short period of time, your website may be paralyzed.

    My suggestion is to wait 1 second after crawling a page and modify your code:

    import urllib
    import urllib2
    import re
    import time
    
    for page in range(1, 11):
        print('at page %s' % page)
        url = 'http://www.qiushibaike.com/8hr/page/' + str(page)
        user_agent = 'Mozilla/5.0 ( Windows NT 6.1)'
        headers = { 'User-Agent' : user_agent }
    
        try:
            request = urllib2.Request(url,headers = headers)
            response = urllib2.urlopen(request)
            content = response.read().decode('utf-8')
            pattern = re.compile('<p.*?class="content">.*?<span>(.*?)</span>.*?</p>.*?',re.S)
            items = re.findall(pattern,content)
            for item in items:
                print item
    
        except urllib2.URLError, e:
            if hasattr(e,"code"):
                print e.code
            if hasattr(e,"reason"):
                print e.reason
        
        time.sleep(1)

    I can get results here, but I would like to recommend another third-party library to you, called requests. Since you know urllib, this is not difficult, but it is more user-friendly to use, and it works with the BeatuifulSoup library (used for It is very convenient to parse and process HTML text. You can also search online to find out more.

    Also, when doing crawling in the future, you must pay attention to prevent anti-crawling!

    reply
    0
  • Cancelreply