采集程序其实并不难做,只要分析清楚流程,然后使用合适的正则来取到你想要的内容就可以了。废话不说了,教程开始:
1.分析入口:
多打开几本书后,可以发现书名的基本格式是:http://www.86zw.com/Book/书号/Index.aspx。于是得出:
代码:
$BookId='1888';
$index="http://www.86zw.com/Book/".$BookId."/Index.aspx";//组合书目首页URL
2.打开页面:
代码:
$contents=file_get_contents($index);
3.抓取图书信息页:
代码:
//抓取图书相关信息
preg_match_all("/
preg_match_all("/【点击阅读】/is",$contents,$Arraylist);
unset($contents);
$title=$Arraytitle[1][0];//书名
$list="http://www.86zw.com".trim($Arraylist[1][0]);//列表页URL
4.创建保存目录及文件:
代码:
//生成文本文档名称
$txt_name=$title.".txt";
Creatdir($BookId);//创建图片文件夹
writeStatistic($title."\r\n",$txt_name);//图书标题写入文本文件
5.进入列表页:
代码:
//进入列表页
$list_contents=file_get_contents($list);
6.抓取列表页章节:
代码:
//进入列表页
//分章节抓块
preg_match_all("|
//计算总章节数
$regcount=count($Block[0]);
7.分章节进行抓取:
代码:
//进入章节
for($pageBookNum=0;$pageBookNum unset($Zhang);
unset($list_url);
$Zhang=$Block[1][$pageBookNum];//章节标题
writeStatistic('章节:'.($pageBookNum+1).' '.$Zhang."\r\n",$txt_name);//章节标题写入文本文件
preg_match_all("|
//进入页面
for($ListNum=0;$ListNum
unset($Book);
unset($Book_contents);
unset($Book_time);
unset($Book_title);
$Book_time=$list_url[2][$ListNum];//小章节更新信息
$Book_title=$list_url[3][$ListNum];//小章节标题
$Book_url=preg_replace("'Index.shtm'si",$list_url[1][$ListNum],$list);//小章节链接URL
writeStatistic(($ListNum+1).'.'.$Book_title.'-'.$Book_time."\r\n",$txt_name);//小章节标题写入文本文件
$Book=file_get_contents($Book_url);
//抓取图书内容
preg_match_all("/

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Dreamweaver CS6
Visual web development tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Notepad++7.3.1
Easy-to-use and free code editor
