python - How to check what kind of anti-crawler is used on a URL

Question

URL: https://www.nvshens.com/g/22377/. Open the website directly with the browser and right-click on the image to download it. Then the image directly requested by my crawler has been blocked, and then I changed I have added headers and set up IP proxy, but it still doesn’t work. But if you look at the packet capture, it doesn't seem to be moving...

过去多啦不再A梦 · Answer

The girl is quite pretty.
It can indeed be opened by right-clicking, but after refreshing, it becomes a hotlinked picture. Generally, to prevent hotlinking, the server will check the Referer field in the request header, which is why it is not the original image after refreshing (the Referer has changed after refreshing).

img_url = "https://t1.onvshen.com:85/gallery/21501/22377/s/003.jpg"
r = requests.get(img_url, headers={'Referer':"https://www.nvshens.com/g/22377/"}).content
with open("00.jpg",'wb') as f:
    f.write(r)

欧阳克 · Answer

Did you miss any parameters by capturing the packet when getting the picture?

我想大声告诉你 · Answer

I was just looking at the content of the website and almost forgot about it being official.
You can follow all the information you requested

Then try it

女神的闺蜜爱上我 · Answer

Referer According to the design of this website, each page should be more in line with the behavior of pretending to be a human being, instead of using a single Referer
The following is the complete code that can be run to capture all the pictures on page 18

# Putting all together
def url_guess_src_large (u):
    return ("https://www.nvshens.com/img.html?img=" +  '/'.join(u.split('/s/')))
# 下载函数
def get_img_using_requests(url, fn ):
    import shutil
    headers ['Referer'] = url_guess_src_large(url) #"https://www.nvshens.com/g/22377/" 
    print (headers)
    response = requests.get(url, headers = headers, stream=True)
    with open(fn, 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response

import requests
# 用xpath擷取內容
from lxml import etree
url_ = 'https://www.nvshens.com/g/22377/{p}.html'  
headers = {
    "Connection" : "close",  # one way to cover tracks
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2900.1 Iron Safari/537.36}"
}

for i in range(1,18+1):
    url = url_.format(p=i)
    r = requests.get(url, headers=headers)
    html = requests.get(url,headers=headers).content.decode('utf-8')
    selector = etree.HTML(html)
    xpaths = '//*[@id="hgallery"]/img/@src'
    content = [x for x in selector.xpath(item)]
    urls_2get = [url_guess_src_large(x) for x in content]
    filenames = [os.path.split(x)[0].split('/gallery/')[1].replace("/","_") + "_" + os.path.split(x)[1] for x in urls_2get]
    for i, x in enumerate(content):
        get_img_using_requests (content[i], filenames[i])

python - How to check what kind of anti-crawler is used on a URL

reply all(4)I'll reply