python - 禁止自己的网站被爬虫爬去？

Question

禁止自己的网站被爬虫爬去？有什么方法啊

迷茫 · Answer

Add a robots.txt file, content:

User-agent: *
Disallow: /

怪我咯 · Answer

Add robots.txt to tell the crawler not to crawl my website, but it will not be forcibly banned. This is just an agreement that both parties need to abide by.

巴扎黑 · Answer

I don’t know if the crawler you are talking about refers to Baidu crawler or the crawler we wrote ourselves.

Baidu crawlers can just follow the method above. There are many ways to prevent other people's crawlers, such as dynamically generating all classes or ids. Because crawlers usually parse HTML to get what they want through class or id.

大家讲道理 · Answer

It also depends on what kind of reptile it is
A gentleman? Miniature?
If this crawler can abide by the robots.txt agreement, then it’s fine
But this is just a gentleman’s agreement
If it encounters a villain, then it’s okay

迷茫 · Answer

1) You can try gzip compression for JS. Many crawlers will not crawl gzip-compressed js.
2) Use log to analyze the logs of the web server. If it is malicious access to your key resources, and the other party is a fixed IP , you can try to ban the other party’s IP

黄舟 · Answer

To be reasonable, it is impossible to do it absolutely

天蓬老师 · Answer

It’s useless. First of all, if your website is open to people, it will naturally be open to crawlers, unless it is changed to an internal network. If you focus on preventing crawlers from getting up, you might as well improve the quality. Now it is a classified information website It’s all crawling around, but the user experience is basically not improved.

迷茫 · Answer

Pfft, you can mess up the class and id so that there is no pattern and even the regular rules will not match

阿神 · Answer

I don’t know if it’s possible to dynamically generate all web content using js

巴扎黑 · Answer

First of all, it is difficult for you to prevent 100% crawlers from being crawled, unless it is an internal network as mentioned above.

But you can take some measures to prevent some low-tech crawlers from crawling your website.

For specific measures, you can go to Zhihu. To read this article, click here

Hope it helps you

python - 禁止自己的网站被爬虫爬去？

reply all(13)I'll reply