search

Home  >  Q&A  >  body text

Web crawler - How to crawl the pictures in the Blog Park blog using python?

I wrote a small piece of code to crawl the pictures in the Blog Park blog. This code is effective for some links, but some links report errors as soon as they are crawled. What is the reason?

#coding=utf-8

import urllib
import re
from lxml import etree

#解析地址
def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

#获取地址并建树
url = "http://www.cnblogs.com/fnng/archive/2013/05/20/3089816.html"
html = getHtml(url)
html = html.decode("utf-8")
tree = etree.HTML(html)

#保存图片至本地
reg = r'src="(.*?)" alt'
imgre = re.compile(reg)
imglist = re.findall(imgre, html)
x = 0
for imgurl in imglist:
    urllib.urlretrieve(imgurl, '%s.jpg' % x)
    x += 1

As shown in the figure, the image can be crawled correctly

If you change the url to

url = "http://www.cnblogs.com/baronzhang/p/6861258.html"

then report an error immediately

Please solve it, thank you!

某草草某草草2754 days ago909

reply all(1)I'll reply

  • 我想大声告诉你

    我想大声告诉你2017-05-18 10:47:39

    The error message is already very obvious. If you look at the source code of the web page, the first image matched is in GIF format, and it is still a relative path, so you cannot download it, so it prompts IOerror, even if you have downloaded it. , because you specified the format as JPG, you cannot open it. So all you need to do is judge and filter

    for imgurl in imglist:
        if "gif" not in imgurl:
            urllib.urlretrieve(imgurl, '%s.jpg' % x)
            x += 1
    

    Look at what I added. Of course, this is just the simplest judgment, but it can ensure that your second program will not report an error, and it also gives you an idea!

    reply
    0
  • Cancelreply