search
Homephp教程PHP源码Use PHP crawler for tourism data analysis

With the improvement of material resources, tourism has gradually become the focus of people. The hot National Day has just passed. Taking advantage of this residual heat, I think many people want to know where people usually go to play, so I spent 10 minutes writing it. A small program that collects Mafengwo travel notes. Of course, it can be so fast because it completely relies on the famous PHP crawler framework phpspider.

International practice, let’s take a look at how to write the code first, let’s consider it as an introduction ^_^

Mafengwo is different from regular websites because the concurrency is high and some data needs to be real-time, such as the number of viewers and the number of likes, so the website Ajax is used in many places, and Ajax is a relatively big problem for ordinary collectors.

Observed the Mafengwo website and finally determined the collection route:

Get popular cities-> Get the list of travel notes under the city-> Get the travel note content-> Extract the travel note title, city, departure time, etc. of the travel note content, Next we use three steps to implement it. . .

1. Get popular cities


http://www.mafengwo.cn/mdd/citylist/21536.html

Use PHP crawler for tourism data analysis

First we need to collect these popular cities

Use PHP crawler for tourism data analysis

When we clicked on the page number, we found that his data was loaded by Ajax, the last page was 297, and the POST method was used

Use PHP crawler for tourism data analysis

The submitted parameters are as follows:

Use PHP crawler for tourism data analysis

Obviously this page is the page number , there is a problem here. The phpspider framework has a URL deduplication mechanism. For POST, there is only one URL, but query_string does not affect the POST data. We can add ?page=1|2|3... at the end, so we The code can be written like this:

Set the list page rules:

'list_url_regexes' => array(
    "http://www.mafengwo.cn/mdd/base/list/pagedata_citylist?page=d+",
)

List all cities at the entrance callback function:

$spider->on_scan_page = function($page, $content, $phpspider) 
{
    // 上面Ajax分页的末页是297页
    for ($i = 0; $i  $url,
            'method' => 'post',
            'fields' => array(
                'mddid'=>21536,
                'page'=>$i,
            )
        );
        // 热点城市列表页URL入队列
        $phpspider->add_url($url, $options);
    }
};

2. Get the travel notes list under popular cities

After clicking to enter a city, we can see it below The travel notes list

Use PHP crawler for tourism data analysis

is of course the same as above, it is also loaded by Ajax. We can open the developer tools of chrome, click Network, and then click on a page to get the Ajax URL:

Use PHP crawler for tourism data analysis

The same as the city list, It is also POST, and the parameters are as follows:

Use PHP crawler for tourism data analysis

Obviously page is the number of pages. Of course, we access the Ajax address directly through POST:

http://www.mafengwo.cn/gonglve/ajax.php?act=get_t …

reports an error directly, it needs the source. Based on the above, our code can be written like this:

First we need to add the source URL in the on_start callback function

$spider->on_start = function($phpspider)
{
    $phpspider->add_header('Referer','http://www.mafengwo.cn/mdd/citylist/21536.html');
};

Same as getting the city list above, set the list matching Rules:

'list_url_regexes' => array(
    "http://www.mafengwo.cn/gonglve/ajax.php?act=get_travellist&mddid=d+", 
)

Then in the on_list_page callback, determine if it is the first page, get the total number of pages, and then loop into the queue:

preg_match(&#39;#<span class="count">共<span>(.*?)</span>页#&#39;, $data_page, $out);
for ($i = 0; $i < $out[1]; $i++) 
{
    $v = $page[&#39;request&#39;][&#39;fields&#39;][&#39;mddid&#39;];
    $url = "http://www.mafengwo.cn/gonglve/ajax.php?act=get_travellist&mddid={$v}&page={$i}";
    $options = array(
        &#39;url_type&#39; => $url,
        &#39;method&#39; => &#39;post&#39;,
        &#39;fields&#39; => array(
            &#39;mddid&#39;=>$v,
            &#39;pageid&#39;=>&#39;mdd_index&#39;,
            &#39;sort&#39;=>1,
            &#39;cost&#39;=>0,
            &#39;days&#39;=>0,
            &#39;month&#39;=>0,
            &#39;tagid&#39;=>0,
            &#39;page&#39;=>$i,
        )
    );
    // 游记列表页URL入队列
    $phpspider->add_url($url, $options);
}

Through the above two steps, we have put the travel notes list of all popular cities. Queue, then we proceed to the third step, obtain the content page URL from these lists, and then extract the content.

3. Get a list of travel notes in popular cities

In the on_list_page method, you will get the content of the list page. From these contents, we can extract the URL of the content page

// 获取内容页
preg_match_all(&#39;#<a href="/i/(.*?).html" target="_blank">#&#39;, $html, $out);
if (!empty($out[1])) 
{
    foreach ($out[1] as $v) 
    {
        $url = "http://www.mafengwo.cn/i/{$v}.html";
        // 内容页URL入队列
        $phpspider->add_url($url);
    }
}

Let’s configure the field to extract the content page field

&#39;fields&#39; => array(
    // 标题
    array(
        &#39;name&#39; => "name",
        &#39;selector&#39; => "//h1[contains(@class,&#39;headtext&#39;)]",
        &#39;required&#39; => true,
    ),
    // 分类
    array(
        &#39;name&#39; => "city",
        &#39;selector&#39; => "//div[contains(@class,&#39;relation_mdd&#39;)]//a",
        &#39;required&#39; => true,
    ),
    // 出发时间
    array(
        &#39;name&#39; => "date",
        &#39;selector&#39; => "//li[contains(@class,&#39;time&#39;)]",
        &#39;required&#39; => true,
    ),
)

Design a data table:

Use PHP crawler for tourism data analysis

Of course, we can also get the views, collections, shares, pins, play amounts, etc. of the travel notes. There are too many, and the methods are similar.

The program has been designed at this point, with a total of less than 200 lines of code. Thanks to the multi-process collection function of phpspider, the data collection was completed quickly, with a total of more than 7W.

Use PHP crawler for tourism data analysis

Use PHP crawler for tourism data analysis

After getting this data, what can we do? !

Top10 tourist cities are

Use PHP crawler for tourism data analysis

It can be seen that Yunnan is a good place, and it is also a place that bloggers miss day and night. . .

Proportion of tourist cities during May Day and National Day

Use PHP crawler for tourism data analysis

Use PHP crawler for tourism data analysis

It can be seen that everyone likes to go to Tibet during May Day, but Qingdao is more popular during National Day. Well, bloggers have never been to these two places, and they feel very hurt~_~!

Next, let’s take a look at the peak tourist seasons in Beijing and Hangzhou this year

1Use PHP crawler for tourism data analysis

It can be seen that more people will go to Beijing in July and August. Beijing is the most pleasant at this time, neither hot nor cold, Bo I once went to Beijing in August one year and it was so comfortable ^_^

Let’s take a look at Hangzhou again

1Use PHP crawler for tourism data analysis

It can be seen that the end of March to mid-April is a suitable season for visiting Hangzhou, when spring is warm The flowers are blooming and the weather is good. I heard that there will be cherry blossoms and tulips in Prince Bay Park every year. They are very beautiful. Emma has the travel bug again~_~!

Okay, that’s the end of the article. Actually, I still want to Analyze more, such as collecting popular routes, popular attractions, popular atlases, and the prices of travel routes, and finally form a travel APP. If you have good ideas, you can also tell me and I will put them Collected for your reference^_^


Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Atom editor mac version download

Atom editor mac version download

The most popular open source editor

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

SublimeText3 English version

SublimeText3 English version

Recommended: Win version, supports code prompts!

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.