PHP图书网站采集实例教程-php手册-php.cn

Home

php教程

php手册

PHP图书网站采集实例教程

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 21, 2016 am 08:57 AM

bookltnbsp

在网上看到很多简单的采集教程，尤其是针对图书网站的比较多，但附带实例的并不多，在看了一篇针对八路中文网的抓取分析后，决定针对这个网站，写一个简单的抓取教程，并附带实例。由于俺偷懒，文中很多分析都是来自《利用PHP制作简单的内容采集器》，俺只是进一步优化了他的流程，并完成了代码实例的编写。
   采集程序其实并不难做，只要分析清楚流程，然后使用合适的正则来取到你想要的内容就可以了。废话不说了，教程开始：
   1.分析入口：
   多打开几本书后，可以发现书名的基本格式是：http://www.86zw.com/Book/书号/Index.aspx。于是得出：

代码:
$BookId='1888';
$index="http://www.86zw.com/Book/".$BookId."/Index.aspx";//组合书目首页URL
2.打开页面：

代码:
$contents=file_get_contents($index);
3.抓取图书信息页：

代码:
//抓取图书相关信息
preg_match_all("/

(.*)/is",$contents,$Arraytitle);
preg_match_all("/【点击阅读】/is",$contents,$Arraylist);
unset($contents);
$title=$Arraytitle[1][0];//书名
$list="http://www.86zw.com".trim($Arraylist[1][0]);//列表页URL
4.创建保存目录及文件：

代码:
//生成文本文档名称
$txt_name=$title.".txt";
Creatdir($BookId);//创建图片文件夹
writeStatistic($title."\r\n",$txt_name);//图书标题写入文本文件
5.进入列表页：

代码:
//进入列表页
$list_contents=file_get_contents($list);
6.抓取列表页章节：

代码:
//进入列表页
//分章节抓块
preg_match_all("|

(.*) 【分卷阅读】(.*)

|Uis",$list_contents,$Block);
//计算总章节数
$regcount=count($Block[0]);
7.分章节进行抓取：

代码:
//进入章节
for($pageBookNum=0;$pageBookNum    unset($Zhang);
    unset($list_url);
    $Zhang=$Block[1][$pageBookNum];//章节标题
    writeStatistic('章节：'.($pageBookNum+1).' '.$Zhang."\r\n",$txt_name);//章节标题写入文本文件
    preg_match_all("|

(.*)|Uis",$Block[3][$pageBookNum],$list_url);
    //进入页面
    for($ListNum=0;$ListNum        unset($Book_url);
        unset($Book);
        unset($Book_contents);
        unset($Book_time);
        unset($Book_title);
        $Book_time=$list_url[2][$ListNum];//小章节更新信息
        $Book_title=$list_url[3][$ListNum];//小章节标题
        $Book_url=preg_replace("'Index.shtm'si",$list_url[1][$ListNum],$list);//小章节链接URL
        writeStatistic(($ListNum+1).'.'.$Book_title.'-'.$Book_time."\r\n",$txt_name);//小章节标题写入文本文件
        $Book=file_get_contents($Book_url);
        //抓取图书内容
        preg_match_all("/

(.*)

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Nordhold: Fusion System, Explained

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6

Visual web development tools

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1

Easy-to-use and free code editor

Hot Topics

1668

1426

1329

1273

1256