Home >php教程 >PHP源码 >Use PHP crawler for tourism data analysis

Use PHP crawler for tourism data analysis

大家讲道理
大家讲道理Original
2016-11-11 15:15:554486browse

With the improvement of material resources, tourism has gradually become the focus of people. The hot National Day has just passed. Taking advantage of this residual heat, I think many people want to know where people usually go to play, so I spent 10 minutes writing it. A small program that collects Mafengwo travel notes. Of course, it can be so fast because it completely relies on the famous PHP crawler framework phpspider.

International practice, let’s take a look at how to write the code first, let’s consider it as an introduction ^_^

Mafengwo is different from regular websites because the concurrency is high and some data needs to be real-time, such as the number of viewers and the number of likes, so the website Ajax is used in many places, and Ajax is a relatively big problem for ordinary collectors.

Observed the Mafengwo website and finally determined the collection route:

Get popular cities-> Get the list of travel notes under the city-> Get the travel note content-> Extract the travel note title, city, departure time, etc. of the travel note content, Next we use three steps to implement it. . .

1. Get popular cities


http://www.mafengwo.cn/mdd/citylist/21536.html

Use PHP crawler for tourism data analysis

First we need to collect these popular cities

Use PHP crawler for tourism data analysis

When we clicked on the page number, we found that his data was loaded by Ajax, the last page was 297, and the POST method was used

Use PHP crawler for tourism data analysis

The submitted parameters are as follows:

Use PHP crawler for tourism data analysis

Obviously this page is the page number , there is a problem here. The phpspider framework has a URL deduplication mechanism. For POST, there is only one URL, but query_string does not affect the POST data. We can add ?page=1|2|3... at the end, so we The code can be written like this:

Set the list page rules:

'list_url_regexes' => array(
    "http://www.mafengwo.cn/mdd/base/list/pagedata_citylist?page=d+",
)

List all cities at the entrance callback function:

$spider->on_scan_page = function($page, $content, $phpspider) 
{
    // 上面Ajax分页的末页是297页
    for ($i = 0; $i  $url,
            'method' => 'post',
            'fields' => array(
                'mddid'=>21536,
                'page'=>$i,
            )
        );
        // 热点城市列表页URL入队列
        $phpspider->add_url($url, $options);
    }
};

2. Get the travel notes list under popular cities

After clicking to enter a city, we can see it below The travel notes list

Use PHP crawler for tourism data analysis

is of course the same as above, it is also loaded by Ajax. We can open the developer tools of chrome, click Network, and then click on a page to get the Ajax URL:

Use PHP crawler for tourism data analysis

The same as the city list, It is also POST, and the parameters are as follows:

Use PHP crawler for tourism data analysis

Obviously page is the number of pages. Of course, we access the Ajax address directly through POST:

http://www.mafengwo.cn/gonglve/ajax.php?act=get_t …

reports an error directly, it needs the source. Based on the above, our code can be written like this:

First we need to add the source URL in the on_start callback function

$spider->on_start = function($phpspider)
{
    $phpspider->add_header('Referer','http://www.mafengwo.cn/mdd/citylist/21536.html');
};

Same as getting the city list above, set the list matching Rules:

'list_url_regexes' => array(
    "http://www.mafengwo.cn/gonglve/ajax.php?act=get_travellist&mddid=d+", 
)

Then in the on_list_page callback, determine if it is the first page, get the total number of pages, and then loop into the queue:

preg_match(&#39;#<span class="count">共<span>(.*?)</span>页#&#39;, $data_page, $out);
for ($i = 0; $i < $out[1]; $i++) 
{
    $v = $page[&#39;request&#39;][&#39;fields&#39;][&#39;mddid&#39;];
    $url = "http://www.mafengwo.cn/gonglve/ajax.php?act=get_travellist&mddid={$v}&page={$i}";
    $options = array(
        &#39;url_type&#39; => $url,
        &#39;method&#39; => &#39;post&#39;,
        &#39;fields&#39; => array(
            &#39;mddid&#39;=>$v,
            &#39;pageid&#39;=>&#39;mdd_index&#39;,
            &#39;sort&#39;=>1,
            &#39;cost&#39;=>0,
            &#39;days&#39;=>0,
            &#39;month&#39;=>0,
            &#39;tagid&#39;=>0,
            &#39;page&#39;=>$i,
        )
    );
    // 游记列表页URL入队列
    $phpspider->add_url($url, $options);
}

Through the above two steps, we have put the travel notes list of all popular cities. Queue, then we proceed to the third step, obtain the content page URL from these lists, and then extract the content.

3. Get a list of travel notes in popular cities

In the on_list_page method, you will get the content of the list page. From these contents, we can extract the URL of the content page

// 获取内容页
preg_match_all(&#39;#<a href="/i/(.*?).html" target="_blank">#&#39;, $html, $out);
if (!empty($out[1])) 
{
    foreach ($out[1] as $v) 
    {
        $url = "http://www.mafengwo.cn/i/{$v}.html";
        // 内容页URL入队列
        $phpspider->add_url($url);
    }
}

Let’s configure the field to extract the content page field

&#39;fields&#39; => array(
    // 标题
    array(
        &#39;name&#39; => "name",
        &#39;selector&#39; => "//h1[contains(@class,&#39;headtext&#39;)]",
        &#39;required&#39; => true,
    ),
    // 分类
    array(
        &#39;name&#39; => "city",
        &#39;selector&#39; => "//div[contains(@class,&#39;relation_mdd&#39;)]//a",
        &#39;required&#39; => true,
    ),
    // 出发时间
    array(
        &#39;name&#39; => "date",
        &#39;selector&#39; => "//li[contains(@class,&#39;time&#39;)]",
        &#39;required&#39; => true,
    ),
)

Design a data table:

Use PHP crawler for tourism data analysis

Of course, we can also get the views, collections, shares, pins, play amounts, etc. of the travel notes. There are too many, and the methods are similar.

The program has been designed at this point, with a total of less than 200 lines of code. Thanks to the multi-process collection function of phpspider, the data collection was completed quickly, with a total of more than 7W.

Use PHP crawler for tourism data analysis

Use PHP crawler for tourism data analysis

After getting this data, what can we do? !

Top10 tourist cities are

Use PHP crawler for tourism data analysis

It can be seen that Yunnan is a good place, and it is also a place that bloggers miss day and night. . .

Proportion of tourist cities during May Day and National Day

Use PHP crawler for tourism data analysis

Use PHP crawler for tourism data analysis

It can be seen that everyone likes to go to Tibet during May Day, but Qingdao is more popular during National Day. Well, bloggers have never been to these two places, and they feel very hurt~_~!

Next, let’s take a look at the peak tourist seasons in Beijing and Hangzhou this year

1Use PHP crawler for tourism data analysis

It can be seen that more people will go to Beijing in July and August. Beijing is the most pleasant at this time, neither hot nor cold, Bo I once went to Beijing in August one year and it was so comfortable ^_^

Let’s take a look at Hangzhou again

1Use PHP crawler for tourism data analysis

It can be seen that the end of March to mid-April is a suitable season for visiting Hangzhou, when spring is warm The flowers are blooming and the weather is good. I heard that there will be cherry blossoms and tulips in Prince Bay Park every year. They are very beautiful. Emma has the travel bug again~_~!

Okay, that’s the end of the article. Actually, I still want to Analyze more, such as collecting popular routes, popular attractions, popular atlases, and the prices of travel routes, and finally form a travel APP. If you have good ideas, you can also tell me and I will put them Collected for your reference^_^


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn