Home  >  Q&A  >  body text

python - How to check what kind of anti-crawler is used on a URL

Website: https://www.nvshens.com/g/22377/. Open the website directly in the browser and right-click on the image to download it. Then the image directly requested by my crawler has been blocked. Then I changed the headers and set up the IP proxy, but it still didn't work. But looking at the packet capture, it’s not dynamically loaded data! ! ! Please answer = =

ringa_leeringa_lee2666 days ago1206

reply all(4)I'll reply

  • 过去多啦不再A梦

    过去多啦不再A梦2017-06-12 09:29:51

    The girl is quite pretty.
    It can indeed be opened by right-clicking, but after refreshing, it becomes a hotlinked picture. Generally, to prevent hotlinking, the server will check the Referer field in the request header, which is why it is not the original image after refreshing (the Referer has changed after refreshing).

    img_url = "https://t1.onvshen.com:85/gallery/21501/22377/s/003.jpg"
    r = requests.get(img_url, headers={'Referer':"https://www.nvshens.com/g/22377/"}).content
    with open("00.jpg",'wb') as f:
        f.write(r)
    

    reply
    0
  • 欧阳克

    欧阳克2017-06-12 09:29:51

    Did you miss any parameters by capturing the packet when getting the picture?

    reply
    0
  • 我想大声告诉你

    我想大声告诉你2017-06-12 09:29:51

    I was just looking at the content of the website and almost forgot about it being official.
    You can follow all the information you requested

    Then try it

    reply
    0
  • 女神的闺蜜爱上我

    女神的闺蜜爱上我2017-06-12 09:29:51

    Referer According to the design of this website, each page should be more in line with the behavior of pretending to be a human being, instead of using a single Referer
    The following is the complete code that can be run to capture all the pictures on page 18

    # Putting all together
    def url_guess_src_large (u):
        return ("https://www.nvshens.com/img.html?img=" +  '/'.join(u.split('/s/')))
    # 下载函数
    def get_img_using_requests(url, fn ):
        import shutil
        headers ['Referer'] = url_guess_src_large(url) #"https://www.nvshens.com/g/22377/" 
        print (headers)
        response = requests.get(url, headers = headers, stream=True)
        with open(fn, 'wb') as out_file:
            shutil.copyfileobj(response.raw, out_file)
        del response
    
    import requests
    # 用xpath擷取內容
    from lxml import etree
    url_ = 'https://www.nvshens.com/g/22377/{p}.html'  
    headers = {
        "Connection" : "close",  # one way to cover tracks
        "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2900.1 Iron Safari/537.36}"
    }
    
    for i in range(1,18+1):
        url = url_.format(p=i)
        r = requests.get(url, headers=headers)
        html = requests.get(url,headers=headers).content.decode('utf-8')
        selector = etree.HTML(html)
        xpaths = '//*[@id="hgallery"]/img/@src'
        content = [x for x in selector.xpath(item)]
        urls_2get = [url_guess_src_large(x) for x in content]
        filenames = [os.path.split(x)[0].split('/gallery/')[1].replace("/","_") + "_" + os.path.split(x)[1] for x in urls_2get]
        for i, x in enumerate(content):
            get_img_using_requests (content[i], filenames[i])

    reply
    0
  • Cancelreply