Home >Backend Development >PHP Tutorial >How PHP implements anti-crawler technology and protects website content

How PHP implements anti-crawler technology and protects website content

WBOY
WBOYOriginal
2023-06-27 08:36:071786browse

With the development of the Internet, the content of the website has become more and more abundant, attracting more and more users to visit. But the problem that comes with it is that it is attacked by malicious crawlers, causing website content to be crawled and stolen. Therefore, how to use anti-crawler technology to protect website content has become a problem that every webmaster must solve. PHP is a popular open source scripting language that is easy to learn and powerful. So how to use PHP to implement anti-crawler technology? The following will explain it to you in detail.

1. Set HTTP request header

Generally, when a normal browser accesses a web page, the request header sent will contain corresponding parameter information. Malicious crawlers generally do not send these parameters, so we can identify malicious crawlers by setting HTTP request headers. PHP provides a very convenient function curl_setopt(), which can be used to set request headers. The specific implementation is as follows:

$curl = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.example.com");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64...)");
curl_setopt($ch, CURLOPT_REFERER, "http://www.example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
curl_close($ch);

Adds User-Agent, Referrer and other information to the request header, which can identify the browser type, source address and other information. If this information is not added, it is likely to be identified as a malicious crawler and blocked.

2. Verification code verification

Verification code is an effective anti-crawler technology that prevents machines from automatically crawling the website by adding verification codes. In PHP, we can use the GD library and Session technology to implement the verification code. The specific code is as follows:

<?php
session_start();
$width=90;
$height=40;
$str = "abcdefghijklmnpqrstuvwxyzABCDEFGHIJKLMNPQRSTUVWXYZ0123456789";
$code = '';
for ($i = 0; $i < 4; $i++) {
   $code .= substr($str, mt_rand(0, strlen($str) - 1), 1);
}
$_SESSION['code'] = $code;

$img = imagecreatetruecolor($width, $height);
$bg_color = imagecolorallocate($img, 255, 255, 255);
imagefill($img, 0, 0, $bg_color);
$font_file="arial.ttf";
for ($i = 0; $i < 4; $i++) {
     $font_size=mt_rand(14,18);
     $font_color=imagecolorallocate($img,mt_rand(0,100),mt_rand(0,100),mt_rand(0,100));
     $angle=mt_rand(-30,30);
     $x=floor($width/6)*$i+6;
     $y=mt_rand(20, $height-10);
     imagettftext($img,$font_size,$angle,$x,$y,$font_color,$font_file,substr($code,$i,1));
}

header("Content-type: image/png");
imagepng($img);
imagedestroy($img);
?>

This code generates a random verification code through the function of the GD library and saves the verification code to the Session. middle. Whenever a user visits the page, you can add a verification code to the page, and compare the verification code entered by the user with the verification code saved in the Session. If they are the same, the verification passes, otherwise the verification fails.

3. Limit access frequency

Some crawlers will use cyclic access to automatically crawl the website, which will quickly consume the website's resources and cause the website to crash. In response to this situation, we can curb crawler attacks by limiting the frequency of each IP address accessing the website. In PHP, we can use cache databases such as Redis to limit access frequency. The specific code is as follows:

<?php
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
$ip = $_SERVER["REMOTE_ADDR"];
$key = "visit:".$ip;
$count = $redis->get($key);
if(!$count) {
    $redis->setex($key, 1, 3);//3秒内允许访问一次
} elseif($count < 10) {
    $redis->incr($key);
} else {
    die("您的访问过于频繁,请稍后再试");
}
?>

This code uses Redis's incr() function to accumulate the number of visits to each IP address, and uses the die() function to interrupt the request. When the number of visits reaches the upper limit, The user will be prompted to try again later.

To sum up, PHP, as a powerful open source scripting language, can well support the implementation of anti-crawler technology. By setting HTTP request headers, verification code verification, and limiting access frequency, you can effectively prevent malicious crawlers from attacking the website and protect the security of the website content. Therefore, webmasters can consider adding these anti-crawler technologies to their websites to improve the security and stability of the website.

The above is the detailed content of How PHP implements anti-crawler technology and protects website content. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn