Home  >  Article  >  Backend Development  >  How to set php to prohibit crawling websites

How to set php to prohibit crawling websites

藏色散人
藏色散人Original
2020-07-24 09:35:542779browse

How to prohibit crawling in php: first obtain UA information through the "$_SERVER['HTTP_USER_AGENT'];" method; then store the malicious "USER_AGENT" into the array; finally prohibit mainstream collection such as empty "USER_AGENT" program.

How to set php to prohibit crawling websites

Recommended: "PHP Tutorial"

We all know that there are many crawlers on the Internet. Some are useful for website inclusion, such as Baidu Spider, but there are also useless crawlers that not only do not comply with robots rules and put pressure on the server, but also cannot bring traffic to the website, such as Yisou Spider ( Latest addition: Yisou Spider Has been acquired by UC Shenma Search! Therefore, this article has been removed from the ban on Yisou Spider! ==>Related articles). Recently, Zhang Ge discovered that there were a lot of crawling records of Yisou and other garbage in the nginx log, so he compiled and collected various methods on the Internet to prohibit garbage spiders from crawling the website. While setting up his own website, he also provided reference for all webmasters. .

1. Apache

①. By modifying the .htaccess file

2. Nginx code

Enter the conf directory under the nginx installation directory and change it as follows Save the code as agent_deny.conf
cd /usr/local/nginx/conf
vim agent_deny.conf

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

Then, insert the following code after location / { in the website-related configuration:
include agent_deny.conf;
Such as the configuration of Zhang Ge’s blog:
[marsge@Mars_Server ~]$ cat /usr/local/nginx/conf/zhangge.conf

location / {
try_files $uri $uri/ /index.php?$args;
#这个位置新增1行:
include agent_deny.conf;
rewrite ^/sitemap_360_sp.txt$ /sitemap_360_sp.php last;
rewrite ^/sitemap_baidu_sp.xml$ /sitemap_baidu_sp.php last;
rewrite ^/sitemap_m.xml$ /sitemap_m.php last;
保存后,执行如下命令,平滑重启nginx即可:
/usr/local/nginx/sbin/nginx -s reload

3. PHP code

Place the following method after the first //Get UA information

$ua = $_SERVER['HTTP_USER_AGENT'];
//将恶意USER_AGENT存入数组
$now_ua = array('FeedDemon ','BOT/0.1 (BOT for JCE)','CrawlDaddy ','Java','Feedly','UniversalFeedParser','ApacheBench','Swiftbot','ZmEu','Indy Library','oBot','jaunty','YandexBot','AhrefsBot','MJ12bot','WinHttp','EasouSpider','HttpClient','Microsoft URL Control','YYSpider','jaunty','Python-urllib','lightDeckReports Bot');

//Prohibited Empty USER_AGENT, dedecms and other mainstream collection programs are all empty USER_AGENT, and some sql injection tools are also empty USER_AGENT

if(!$ua) {
header("Content-type: text/html; charset=utf-8");
die('请勿采集本站,因为采集的站长木有小JJ!');
}else{
foreach($now_ua as $value )
//判断是否是数组中存在的UA
if(eregi($value,$ua)) {
header("Content-type: text/html; charset=utf-8");
die('请勿采集本站,因为采集的站长木有小JJ!');
}
}

4. Test effect

If it is vps, it is very simple, use curl -A to simulate Just crawl, for example:
Simulate Yisou Spider crawling:
curl -I -A 'YisouSpider' zhang.ge
Simulate crawling when UA is empty:
curl -I -A ' ' zhang.ge
Simulate the crawling of Baidu Spider:
curl -I -A 'Baiduspider' zhang.ge

Modify the .htaccess in the website directory and add the following code (2 types Code optional): The screenshots of the three crawling results are as follows:

服务器反爬虫攻略:Apache/Nginx/PHP禁止某些User Agent抓取网站

# It can be seen that the empty return of Yisou Spider and UA is a 403 Forbidden Access logo, while Baidu Spider succeeds Return 200, the description is valid!

Supplement: The next day, check the screenshot of the nginx log effect:

①. Garbage collection with empty UA information is intercepted:

服务器反爬虫攻略:Apache/Nginx/PHP禁止某些User Agent抓取网站

②. Banned UAs are intercepted:

服务器反爬虫攻略:Apache/Nginx/PHP禁止某些User Agent抓取网站

Therefore, for the collection of spam spiders, we can analyze the access logs of the website to find out some unseen ones. After the name of the spider is correct, it can be added to the prohibited list of the previous code to prohibit crawling.

5. Appendix: UA Collection

The following is a list of common spam UA on the Internet, for reference only, and you are welcome to add to it.

FeedDemon 内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy sql注入
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
ApacheBench cc攻击器
Swiftbot 无用爬虫
YandexBot 无用爬虫
AhrefsBot 无用爬虫
YisouSpider 无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!)
MJ12bot 无用爬虫
ZmEu phpmyadmin 漏洞扫描
WinHttp 采集cc攻击
EasouSpider 无用爬虫
HttpClient tcp攻击
Microsoft URL Control 扫描
YYSpider 无用爬虫
jaunty wordpress爆破扫描器
oBot 无用爬虫
Python-urllib 内容采集
Indy Library 扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot 无用爬虫

The above is the detailed content of How to set php to prohibit crawling websites. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn