Home  >  Article  >  Backend Development  >  PHP and phpSpider: How to deal with the JS challenge of website anti-crawling?

PHP and phpSpider: How to deal with the JS challenge of website anti-crawling?

WBOY
WBOYOriginal
2023-07-21 14:13:101561browse

PHP and phpSpider: How to deal with the JS challenge of website anti-crawling?

With the development of Internet technology, websites’ defenses against crawler scripts are becoming more and more powerful. Websites often use Javascript technology to anti-crawl, because Javascript can dynamically generate page content, making it difficult for simple crawler scripts to obtain complete data. This article will introduce how to use PHP and phpSpider to deal with the JS challenge of website anti-crawling.

phpSpider is a lightweight crawler framework based on PHP. It provides a simple and easy-to-use API and rich functions, suitable for handling various web page crawling tasks. Its advantage is that it can simulate browser behavior, including executing Javascript code, which allows us to bypass the JS anti-crawler mechanism of the website.

First, we need to install phpSpider. You can install it through Composer and execute the following command in the project directory:

composer require dungsit/php-spider

After the installation is complete, we can use phpSpider to write crawler scripts in the project.

First, we need to create a new phpSpider instance and set the crawled target URL, HTTP header information, etc. The following is an example:

<?php
require 'vendor/autoload.php';

use phpspidercorephpspider;

$configs = array(
    'name' => 'example',
    'log_show' => true,
    'domains' => array(
        'example.com',
    ),
    'scan_urls' => array(
        'http://www.example.com'
    ),
    'list_url_regexes' => array(
        "http://www.example.com/w+",
    ),
    'content_url_regexes' => array(
        "http://www.example.com/[a-z]+/d+",
    ),
    'fields' => array(
        array(
            'name' => 'title',
            'selector' => '//h1',
            'required' => true,
        ),
        array(
            'name' => 'content',
            'selector' => '//div[@class="content"]',
            'required' => true,
        ),
    ),
);

$spider = new phpspider($configs);

$spider->start();

In the above example, we specify the starting page URL that needs to be crawled by setting the scan_urls field and the list_url_regexes field. Specify the URL regular expression of the list page, and the content_url_regexes field specifies the URL regular expression of the content page. In the next fields field, we can set the field name to be captured, the field selector and whether it is a required field.

Since our goal is to bypass the JS anti-crawler mechanism of the website, we need to use a plug-in in phpSpider to execute Javascript code. You can use the ExecuteJsPlugin plug-in to achieve this function, which is based on the browser packaging library Goutte to execute Javascript code. Here is an example of how to use the ExecuteJsPlugin plugin in phpSpider:

<?php
require 'vendor/autoload.php';

use phpspidercorephpspider;
use phpspidercoreequests;
use phpspidercoreselector;
use phpspiderpluginsexecute_jsExecuteJsPlugin;

// 设置目标网站的域名和UA
requests::set_global('domain', 'example.com');
requests::set_global('user_agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3');

$configs = array(
    'name' => 'example',
    'log_show' => true,
    'domains' => array(
        'example.com',
    ),
    'scan_urls' => array(
        'http://www.example.com'
    ),
    'list_url_regexes' => array(
        "http://www.example.com/w+",
    ),
    'content_url_regexes' => array(
        "http://www.example.com/[a-z]+/d+",
    ),
    'fields' => array(
        array(
            'name' => 'title',
            'selector' => '//h1',
            'required' => true,
        ),
        array(
            'name' => 'content',
            'selector' => '//div[@class="content"]',
            'required' => true,
        ),
    ),
    'plugins' => array(
        new ExecuteJsPlugin(),
    ),
);

$spider = new phpspider($configs);

$spider->start();

In the above example, we first introduced the execute_jsExecuteJsPlugin plugin. Then, we set the domain name and user agent (UA) of the target website, which is to allow phpSpider to simulate browser requests when visiting the target website. Next, we added the ExecuteJsPlugin instance in the plugins field.

After using this plug-in, we can use Javascript expressions in the field's selector to locate elements. For example, we set the selector to '//div[@class="content"]/q', which means that we will select the sub-element q whose class attribute of the div element is "content". In this way, phpSpider can execute this Javascript code to obtain the data.

To sum up, we can use the phpSpider framework and the ExecuteJsPlugin plug-in to deal with the JS challenge of anti-crawling on the website. By simulating browser behavior, we can bypass the website's JS anti-crawler mechanism and easily obtain the required data. I hope this article can be helpful to your crawler development.

Code sample source: https://github.com/nmred/phpspider

The above is the detailed content of PHP and phpSpider: How to deal with the JS challenge of website anti-crawling?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn