Website: https://www.nvshens.com/g/22377/. Open the website directly in the browser and right-click on the image to download it. Then the image directly requested by my crawler has been blocked. Then I changed the headers and set up the IP proxy, but it still didn't work. But looking at the packet capture, it’s not dynamically loaded data! ! ! Please answer = =
过去多啦不再A梦2017-06-12 09:29:51
The girl is quite pretty.
It can indeed be opened by right-clicking, but after refreshing, it becomes a hotlinked picture. Generally, to prevent hotlinking, the server will check the Referer
field in the request header, which is why it is not the original image after refreshing (the Referer
has changed after refreshing).
img_url = "https://t1.onvshen.com:85/gallery/21501/22377/s/003.jpg"
r = requests.get(img_url, headers={'Referer':"https://www.nvshens.com/g/22377/"}).content
with open("00.jpg",'wb') as f:
f.write(r)
欧阳克2017-06-12 09:29:51
Did you miss any parameters by capturing the packet when getting the picture?
我想大声告诉你2017-06-12 09:29:51
I was just looking at the content of the website and almost forgot about it being official.
You can follow all the information you requested
Then try it
女神的闺蜜爱上我2017-06-12 09:29:51
Referer According to the design of this website, each page should be more in line with the behavior of pretending to be a human being, instead of using a single Referer
The following is the complete code that can be run to capture all the pictures on page 18
# Putting all together
def url_guess_src_large (u):
return ("https://www.nvshens.com/img.html?img=" + '/'.join(u.split('/s/')))
# 下载函数
def get_img_using_requests(url, fn ):
import shutil
headers ['Referer'] = url_guess_src_large(url) #"https://www.nvshens.com/g/22377/"
print (headers)
response = requests.get(url, headers = headers, stream=True)
with open(fn, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
import requests
# 用xpath擷取內容
from lxml import etree
url_ = 'https://www.nvshens.com/g/22377/{p}.html'
headers = {
"Connection" : "close", # one way to cover tracks
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2900.1 Iron Safari/537.36}"
}
for i in range(1,18+1):
url = url_.format(p=i)
r = requests.get(url, headers=headers)
html = requests.get(url,headers=headers).content.decode('utf-8')
selector = etree.HTML(html)
xpaths = '//*[@id="hgallery"]/img/@src'
content = [x for x in selector.xpath(item)]
urls_2get = [url_guess_src_large(x) for x in content]
filenames = [os.path.split(x)[0].split('/gallery/')[1].replace("/","_") + "_" + os.path.split(x)[1] for x in urls_2get]
for i, x in enumerate(content):
get_img_using_requests (content[i], filenames[i])