How to set php to prohibit crawling websites-PHP Problem-php.cn

Home

Backend Development

PHP Problem

How to set php to prohibit crawling websites

藏色散人

Jul 24, 2020 am 09:35 AM

php

How to prohibit crawling in php: first obtain UA information through the "$_SERVER['HTTP_USER_AGENT'];" method; then store the malicious "USER_AGENT" into the array; finally prohibit mainstream collection such as empty "USER_AGENT" program.

How to set php to prohibit crawling websites

Recommended: "PHP Tutorial"

We all know that there are many crawlers on the Internet. Some are useful for website inclusion, such as Baidu Spider, but there are also useless crawlers that not only do not comply with robots rules and put pressure on the server, but also cannot bring traffic to the website, such as Yisou Spider ( Latest addition: Yisou Spider Has been acquired by UC Shenma Search! Therefore, this article has been removed from the ban on Yisou Spider! ==>Related articles). Recently, Zhang Ge discovered that there were a lot of crawling records of Yisou and other garbage in the nginx log, so he compiled and collected various methods on the Internet to prohibit garbage spiders from crawling the website. While setting up his own website, he also provided reference for all webmasters. .

1. Apache

①. By modifying the .htaccess file

2. Nginx code

Enter the conf directory under the nginx installation directory and change it as follows Save the code as agent_deny.conf
cd /usr/local/nginx/conf
vim agent_deny.conf

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

Then, insert the following code after location / { in the website-related configuration:
include agent_deny.conf;
Such as the configuration of Zhang Ge’s blog:
[marsge@Mars_Server ~]$ cat /usr/local/nginx/conf/zhangge.conf

location / {
try_files $uri $uri/ /index.php?$args;
#这个位置新增1行：
include agent_deny.conf;
rewrite ^/sitemap_360_sp.txt$ /sitemap_360_sp.php last;
rewrite ^/sitemap_baidu_sp.xml$ /sitemap_baidu_sp.php last;
rewrite ^/sitemap_m.xml$ /sitemap_m.php last;
保存后，执行如下命令，平滑重启nginx即可：
/usr/local/nginx/sbin/nginx -s reload

3. PHP code

Place the following method after the first //Get UA information

$ua = $_SERVER[&#39;HTTP_USER_AGENT&#39;];
//将恶意USER_AGENT存入数组
$now_ua = array(&#39;FeedDemon &#39;,&#39;BOT/0.1 (BOT for JCE)&#39;,&#39;CrawlDaddy &#39;,&#39;Java&#39;,&#39;Feedly&#39;,&#39;UniversalFeedParser&#39;,&#39;ApacheBench&#39;,&#39;Swiftbot&#39;,&#39;ZmEu&#39;,&#39;Indy Library&#39;,&#39;oBot&#39;,&#39;jaunty&#39;,&#39;YandexBot&#39;,&#39;AhrefsBot&#39;,&#39;MJ12bot&#39;,&#39;WinHttp&#39;,&#39;EasouSpider&#39;,&#39;HttpClient&#39;,&#39;Microsoft URL Control&#39;,&#39;YYSpider&#39;,&#39;jaunty&#39;,&#39;Python-urllib&#39;,&#39;lightDeckReports Bot&#39;);

//Prohibited Empty USER_AGENT, dedecms and other mainstream collection programs are all empty USER_AGENT, and some sql injection tools are also empty USER_AGENT

if(!$ua) {
header("Content-type: text/html; charset=utf-8");
die(&#39;请勿采集本站，因为采集的站长木有小JJ！&#39;);
}else{
foreach($now_ua as $value )
//判断是否是数组中存在的UA
if(eregi($value,$ua)) {
header("Content-type: text/html; charset=utf-8");
die(&#39;请勿采集本站，因为采集的站长木有小JJ！&#39;);
}
}

4. Test effect

If it is vps, it is very simple, use curl -A to simulate Just crawl, for example:
Simulate Yisou Spider crawling:
curl -I -A 'YisouSpider' zhang.ge
Simulate crawling when UA is empty:
curl -I -A ' ' zhang.ge
Simulate the crawling of Baidu Spider:
curl -I -A 'Baiduspider' zhang.ge

Modify the .htaccess in the website directory and add the following code (2 types Code optional): The screenshots of the three crawling results are as follows:

服务器反爬虫攻略：Apache/Nginx/PHP禁止某些User Agent抓取网站

# It can be seen that the empty return of Yisou Spider and UA is a 403 Forbidden Access logo, while Baidu Spider succeeds Return 200, the description is valid!

Supplement: The next day, check the screenshot of the nginx log effect:

①. Garbage collection with empty UA information is intercepted:

服务器反爬虫攻略：Apache/Nginx/PHP禁止某些User Agent抓取网站

②. Banned UAs are intercepted:

服务器反爬虫攻略：Apache/Nginx/PHP禁止某些User Agent抓取网站

Therefore, for the collection of spam spiders, we can analyze the access logs of the website to find out some unseen ones. After the name of the spider is correct, it can be added to the prohibited list of the previous code to prohibit crawling.

5. Appendix: UA Collection

The following is a list of common spam UA on the Internet, for reference only, and you are welcome to add to it.

FeedDemon 内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy sql注入
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
ApacheBench cc攻击器
Swiftbot 无用爬虫
YandexBot 无用爬虫
AhrefsBot 无用爬虫
YisouSpider 无用爬虫（已被UC神马搜索收购，此蜘蛛可以放开！）
MJ12bot 无用爬虫
ZmEu phpmyadmin 漏洞扫描
WinHttp 采集cc攻击
EasouSpider 无用爬虫
HttpClient tcp攻击
Microsoft URL Control 扫描
YYSpider 无用爬虫
jaunty wordpress爆破扫描器
oBot 无用爬虫
Python-urllib 内容采集
Indy Library 扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot 无用爬虫

The above is the detailed content of How to set php to prohibit crawling websites. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

ACID vs BASE Database: Differences and when to use each.Mar 26, 2025 pm 04:19 PM

The article compares ACID and BASE database models, detailing their characteristics and appropriate use cases. ACID prioritizes data integrity and consistency, suitable for financial and e-commerce applications, while BASE focuses on availability and

PHP Secure File Uploads: Preventing file-related vulnerabilities.Mar 26, 2025 pm 04:18 PM

The article discusses securing PHP file uploads to prevent vulnerabilities like code injection. It focuses on file type validation, secure storage, and error handling to enhance application security.

PHP Input Validation: Best practices.Mar 26, 2025 pm 04:17 PM

Article discusses best practices for PHP input validation to enhance security, focusing on techniques like using built-in functions, whitelist approach, and server-side validation.

PHP API Rate Limiting: Implementation strategies.Mar 26, 2025 pm 04:16 PM

The article discusses strategies for implementing API rate limiting in PHP, including algorithms like Token Bucket and Leaky Bucket, and using libraries like symfony/rate-limiter. It also covers monitoring, dynamically adjusting rate limits, and hand

PHP Password Hashing: password_hash and password_verify.Mar 26, 2025 pm 04:15 PM

The article discusses the benefits of using password_hash and password_verify in PHP for securing passwords. The main argument is that these functions enhance password protection through automatic salt generation, strong hashing algorithms, and secur

OWASP Top 10 PHP: Describe and mitigate common vulnerabilities.Mar 26, 2025 pm 04:13 PM

The article discusses OWASP Top 10 vulnerabilities in PHP and mitigation strategies. Key issues include injection, broken authentication, and XSS, with recommended tools for monitoring and securing PHP applications.

PHP XSS Prevention: How to protect against XSS.Mar 26, 2025 pm 04:12 PM

The article discusses strategies to prevent XSS attacks in PHP, focusing on input sanitization, output encoding, and using security-enhancing libraries and frameworks.

PHP Interface vs Abstract Class: When to use each.Mar 26, 2025 pm 04:11 PM

The article discusses the use of interfaces and abstract classes in PHP, focusing on when to use each. Interfaces define a contract without implementation, suitable for unrelated classes and multiple inheritance. Abstract classes provide common funct

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Will R.E.P.O. Have Crossplay?

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Hot Topics

Where is the login entrance for gmail email?

7546

CakePHP Tutorial

1382

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers