Home >Backend Development >PHP Tutorial >phpspider爬虫框架怎么用

phpspider爬虫框架怎么用

PHPz
PHPzOriginal
2016-06-06 20:52:431960browse

phpspider爬虫框架怎么用

phpspider爬虫框架怎么用?

这几天使用PHP的爬虫框架爬取了一些数据,发现还是挺方便的,先上爬虫框架的文档 phpspider框架文档(https://doc.phpspider.org/)

使用方法其实在文档中写的很清楚而且在demo中也有使用示例,这里放下我自己的代码做个笔记

<?php
include "./autoloader.php";
use phpspider\core\phpspider;
/* Do NOT delete this comment */
/* 不要删除这段注释 */
$configs = array(
    &#39;name&#39; => &#39;中国保温网&#39;,
    &#39;domains&#39; => array(
        &#39;www.cnbaowen.net&#39;,
        &#39;cnbaowen.net&#39;
    ),
    &#39;scan_urls&#39; => array(
        &#39;http://www.cnbaowen.net/news/list-3720-1.html&#39;
    ),
    &#39;export&#39; => array(
        &#39;type&#39; => &#39;db&#39;,
        &#39;table&#39; => &#39;articles_mc&#39;,
    ),
    &#39;db_config&#39; => array(
        &#39;host&#39;  => &#39;127.0.0.1&#39;,
        &#39;port&#39;  => 3306,
        &#39;user&#39;  => &#39;root&#39;,
        &#39;pass&#39;  => &#39;123456&#39;,
        &#39;name&#39;  => &#39;spider&#39;,
    ),
    &#39;content_url_regexes&#39; => array(
        "http://www.cnbaowen.net/news/show-\d+.html"
    ),
    &#39;list_url_regexes&#39; => array(
        "http://www.cnbaowen.net/news/list-3720-\d+.html"
    ),
    &#39;fields&#39; => array(
        array(
            // 抽取内容页的文章内容
            &#39;name&#39; => "title",
            &#39;selector&#39; => "//h1[@id=&#39;title&#39;]",
            &#39;required&#39; => true
        ),
        array(
            // 抽取内容页的文章作者
            &#39;name&#39; => "content",
            &#39;selector&#39; => "//div[@id=&#39;content&#39;]",
            &#39;required&#39; => true
        ),
        array(
            // 抽取内容页的文章作者
            &#39;name&#39; => "type"
        ),
        array(
            // 抽取内容页的文章作者
            &#39;name&#39; => "site_id"
        ),
    ),
);
$spider = new phpspider($configs);
$spider->on_list_page = function($page, $content, $spider){
    for ($i = 2; $i < 24; $i++)
    {
        $url = "http://www.cnbaowen.net/news/list-3720-{$i}.html";
        $spider->add_url($url);
    }
};
$spider->on_extract_field = function($fieldname, $data, $page){
    if($fieldname == "type"){
        return 2;
    }elseif($fieldname == "content"){
        $s = preg_replace("/<div style=\"float:right[\s\S]*?div>/","",$data);
        $s = preg_replace(&#39;/<a .*?href="(.*?)".*?>/is&#39;,"<a href=&#39;#&#39;>",$s);
        $data = preg_replace(&#39;/<img.*?>/is&#39;,"",$s);
        return $data;
    }elseif($fieldname == "site_id"){
        return 1;
    }else{
        return $data;
    }
};
$spider->start();

注释:这里需要说明一点,抓取页面数据时我只需要标题和内容的部分,但是存入数据库时我需要使用到另外两个字段,所以定义字段的时候多定义了`type`和`site_id`两个字段,但是这两个字段的实际赋值是在 `on_extract_field` 回调函数中完成的

附带sql语句

CREATE TABLE `articles_mc` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(200) DEFAULT NULL,
  `content` text,
  `type` int(5) DEFAULT &#39;0&#39; COMMENT &#39;文章类型 1行业资讯 2技术资料&#39;,
  `site_id` int(5) DEFAULT NULL COMMENT &#39;站点id&#39;,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4887 DEFAULT CHARSET=utf8mb4;

更多相关技术文章,请访问PHP中文网

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn