import urllib.request
req = urllib.request.urlopen('http://search.jd.com/Search?k...')
req
Out[3]: <http.client.HTTPResponse at 0x52bf6d8>
buf = req.read()
buf = buf.decode('utf-8')
urllist = re.findall(r'//img. .png',buf)
This will normally display the image URL ending in .png
urllist = re.findall(r'//img. .jpg ',buf)
Also basically normal
urllist = re.findall(r'//img. .(png|jpg)',buf)
This can only display the format of a series of pictures, like this :
'.jpg',
'.jpg',
'.png',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
Why is this?
阿神2017-06-22 11:53:19
Mainly because, when you do not add ()
, re.findall
will print out all the matches, but if you add ()
, it will print the matching, which is ()
Captured results, so you see a bunch of jpg/png
. Because of this, we need to use ()
to capture all the matching links so that they can be printed. At the same time, we need to use (?:jpg |png)
, because what this place needs is to match jpg or png
, so we need to use non-capturing grouping syntax.
# 代码修改
urllist = re.findall(r'(//img.+?.(?:png|jpg))',buf)
For more about capture grouping/non-capturing grouping
, you can refer to: Link description
代言2017-06-22 11:53:19
[png|jpg]
(png|jpg) will be grouped
import re
import requests
r = requests.get('http://search.jd.com/Search?keyword=%E6%96%87%E8%83%B8&enc=utf-8&wq=%E6%96%87%E8%83%B8&pvid=4anf50si.fbrh68')
print re.findall('//img.+.[png|jpg]', r.text)