首页 >php教程 >php手册 >快速开发一个PHP电影爬虫

快速开发一个PHP电影爬虫

WBOY
WBOY原创
2016-06-14 00:02:15781浏览

今天来做一个PHP电影小爬虫。
我们来利用simple_html_dom的采集数据实例,这是一个PHP的库,上手很容易。
simple_html_dom 可以很好的帮助我们利用php解析html文档。通过这个php封装类可以很方便的解析html文档,对其中的html元素进行操作 (PHP5+以上版本)
下载地址:https://github.com/samacs/simple_html_dom
下面我们以 http://www.paopaotv.com 上的列表页 http://paopaotv.com/tv-type-id-5-pg-1.html 字母模式展现的列表为例,抓取页面上的列表数据,以及内容里面信息

<span style="color: #008080;"> 1</span> <span style="color: #000000;">php
</span><span style="color: #008080;"> 2</span> <span style="color: #0000ff;">include_once</span> 'simple_html_dom.php'<span style="color: #000000;">;
</span><span style="color: #008080;"> 3</span> <span style="color: #008000;">//</span><span style="color: #008000;">获取html数据转化为对象</span>
<span style="color: #008080;"> 4</span> <span style="color: #800080;">$html</span> = file_get_html('http://paopaotv.com/tv-type-id-5-pg-1.html'<span style="color: #000000;">);
</span><span style="color: #008080;"> 5</span> <span style="color: #008000;">//</span><span style="color: #008000;">A-Z的字母列表每条数据是在id=letter-focus 的div内class= letter-focus-item的dl标签内,用find方法查找即为 </span>
<span style="color: #008080;"> 6</span> <span style="color: #800080;">$listData</span>=<span style="color: #800080;">$html</span>->find("#letter-focus .letter-focus-item");<span style="color: #008000;">//</span><span style="color: #008000;">$listData为数组对象</span>
<span style="color: #008080;"> 7</span> <span style="color: #0000ff;">foreach</span>(<span style="color: #800080;">$listData</span> <span style="color: #0000ff;">as</span><span style="color: #800080;">$key</span>=><span style="color: #800080;">$eachRowData</span><span style="color: #000000;">){
</span><span style="color: #008080;"> 8</span> <span style="color: #800080;">$filmName</span>=<span style="color: #800080;">$eachRowData</span>->find("dd span",0)->plaintext;<span style="color: #008000;">//</span><span style="color: #008000;">获取影视名称</span>
<span style="color: #008080;"> 9</span> <span style="color: #800080;">$filmUrl</span>=<span style="color: #800080;">$eachRowData</span>->find("dd a",0)->href;<span style="color: #008000;">//</span><span style="color: #008000;">获取dd标签下影视对应的地址
</span><span style="color: #008080;">10</span> <span style="color: #008000;">//获取影视的详细信息</span>
<span style="color: #008080;">11</span> <span style="color: #800080;">$filmInfo</span>=file_get_html("http://paopaotv.com".<span style="color: #800080;">$filmUrl</span><span style="color: #000000;">);
</span><span style="color: #008080;">12</span> <span style="color: #800080;">$filmDetail</span>=<span style="color: #800080;">$filmInfo</span>->find(".info dl"<span style="color: #000000;">);
</span><span style="color: #008080;">13</span> <span style="color: #0000ff;">foreach</span>(<span style="color: #800080;">$filmDetail</span> <span style="color: #0000ff;">as</span> <span style="color: #800080;">$film</span><span style="color: #000000;">){
</span><span style="color: #008080;">14</span> <span style="color: #800080;">$info</span>=<span style="color: #800080;">$film</span>->find("dd"<span style="color: #000000;">);
</span><span style="color: #008080;">15</span> <span style="color: #800080;">$row</span>=<span style="color: #0000ff;">null</span><span style="color: #000000;">;
</span><span style="color: #008080;">16</span> <span style="color: #0000ff;">foreach</span>(<span style="color: #800080;">$info</span> <span style="color: #0000ff;">as</span> <span style="color: #800080;">$childInfo</span><span style="color: #000000;">){
</span><span style="color: #008080;">17</span> <span style="color: #800080;">$row</span>[]=<span style="color: #800080;">$childInfo</span>-><span style="color: #000000;">plaintext;
</span><span style="color: #008080;">18</span> <span style="color: #000000;">}
</span><span style="color: #008080;">19</span> <span style="color: #800080;">$cate</span>[<span style="color: #800080;">$key</span>][]=<span style="color: #008080;">join</span>(",",<span style="color: #800080;">$row</span>);<span style="color: #008000;">//</span><span style="color: #008000;">将影视的信息存放到数组中</span>
<span style="color: #008080;">20</span> <span style="color: #000000;">}
</span><span style="color: #008080;">21</span> }

这样通过simple_html_dom,就可以将paopaotv.com影视列表中信息,以及影视的具体信息就抓取到了,之后你可以继续抓取影视详细页面上的视频地址信息,然后将该影视的所有信息都存放到数据库中。
下面是simple_html_dom常用的属性以及方法:

<span style="color: #008080;"> 1</span> <span style="color: #800080;">$html</span> = file_get_html('http://paopaotv.com/tv-type-id-5-pg-1.html'<span style="color: #000000;">);
</span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$e</span> = <span style="color: #800080;">$html</span>->find("div", 0<span style="color: #000000;">);
</span><span style="color: #008080;"> 3</span> <span style="color: #008000;">//</span><span style="color: #008000;">标签</span>
<span style="color: #008080;"> 4</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">tag;
</span><span style="color: #008080;"> 5</span> <span style="color: #008000;">//</span><span style="color: #008000;">外文本</span>
<span style="color: #008080;"> 6</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">outertext;
</span><span style="color: #008080;"> 7</span> <span style="color: #008000;">//</span><span style="color: #008000;">内文本</span>
<span style="color: #008080;"> 8</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">innertext;
</span><span style="color: #008080;"> 9</span> <span style="color: #008000;">//</span><span style="color: #008000;">纯文本</span>
<span style="color: #008080;">10</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">plaintext;
</span><span style="color: #008080;">11</span> <span style="color: #008000;">//</span><span style="color: #008000;">子元素</span>
<span style="color: #008080;">12</span> <span style="color: #800080;">$e</span>->children ( [int <span style="color: #800080;">$index</span><span style="color: #000000;">] );
</span><span style="color: #008080;">13</span> <span style="color: #008000;">//</span><span style="color: #008000;">父元素</span>
<span style="color: #008080;">14</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">parent ();
</span><span style="color: #008080;">15</span> <span style="color: #008000;">//</span><span style="color: #008000;">第一个子元素</span>
<span style="color: #008080;">16</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">first_child ();
</span><span style="color: #008080;">17</span> <span style="color: #008000;">//</span><span style="color: #008000;">最后一个子元素</span>
<span style="color: #008080;">18</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">last_child ();
</span><span style="color: #008080;">19</span> <span style="color: #008000;">//</span><span style="color: #008000;">后一个兄弟元素</span>
<span style="color: #008080;">20</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">next_sibling ();
</span><span style="color: #008080;">21</span> <span style="color: #008000;">//</span><span style="color: #008000;">前一个兄弟元素</span>
<span style="color: #008080;">22</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">prev_sibling ();
</span><span style="color: #008080;">23</span> <span style="color: #008000;">//</span><span style="color: #008000;">标签数组</span>
<span style="color: #008080;">24</span> <span style="color: #800080;">$ret</span> = <span style="color: #800080;">$html</span>->find('a'<span style="color: #000000;">);
</span><span style="color: #008080;">25</span> <span style="color: #008000;">//</span><span style="color: #008000;">第一个a标签</span>
<span style="color: #008080;">26</span> <span style="color: #800080;">$ret</span> = <span style="color: #800080;">$html</span>->find('a', 0);

更多用法可以参考官方手册。
是不是很简单呢?有问题欢迎提出来交流

声明:
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn