Home  >  Q&A  >  body text

python - 禁止自己的网站被爬虫爬去?

禁止自己的网站被爬虫爬去?有什么方法啊

大家讲道理大家讲道理2743 days ago1043

reply all(13)I'll reply

  • 迷茫

    迷茫2017-04-17 17:35:35

    Add a robots.txt file, content:

    User-agent: *
    Disallow: /

    reply
    0
  • 怪我咯

    怪我咯2017-04-17 17:35:35

    Add robots.txt to tell the crawler not to crawl my website, but it will not be forcibly banned. This is just an agreement that both parties need to abide by.

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-17 17:35:35

    I don’t know if the crawler you are talking about refers to Baidu crawler or the crawler we wrote ourselves.

    Baidu crawlers can just follow the method above. There are many ways to prevent other people's crawlers, such as dynamically generating all classes or ids. Because crawlers usually parse HTML to get what they want through class or id.

    reply
    0
  • 大家讲道理

    大家讲道理2017-04-17 17:35:35

    It also depends on what kind of reptile it is
    A gentleman? Miniature?
    If this crawler can abide by the robots.txt agreement, then it’s fine
    But this is just a gentleman’s agreement
    If it encounters a villain, then it’s okay

    reply
    0
  • 迷茫

    迷茫2017-04-17 17:35:35

    1) You can try gzip compression for JS. Many crawlers will not crawl gzip-compressed js.
    2) Use log to analyze the logs of the web server. If it is malicious access to your key resources, and the other party is a fixed IP , you can try to ban the other party’s IP

    reply
    0
  • 黄舟

    黄舟2017-04-17 17:35:35

    To be reasonable, it is impossible to do it absolutely

    reply
    0
  • 天蓬老师

    天蓬老师2017-04-17 17:35:35

    It’s useless. First of all, if your website is open to people, it will naturally be open to crawlers, unless it is changed to an internal network. If you focus on preventing crawlers from getting up, you might as well improve the quality. Now it is a classified information website It’s all crawling around, but the user experience is basically not improved.

    reply
    0
  • 迷茫

    迷茫2017-04-17 17:35:35

    Pfft, you can mess up the class and id so that there is no pattern and even the regular rules will not match

    reply
    0
  • 阿神

    阿神2017-04-17 17:35:35

    I don’t know if it’s possible to dynamically generate all web content using js

    reply
    0
  • 巴扎黑

    巴扎黑2017-04-17 17:35:35

    First of all, it is difficult for you to prevent 100% crawlers from being crawled, unless it is an internal network as mentioned above.

    But you can take some measures to prevent some low-tech crawlers from crawling your website.

    For specific measures, you can go to Zhihu. To read this article, click here

    Hope it helps you

    reply
    0
  • Cancelreply