Home > Article > Backend Development > Do programmers still read novels with advertisements?
Some people are used to reading novels, and occasionally read a few chapters. They are all published by Baidu, but there are basically very annoying advertisements. Either add links to the overall div, and if they are accidentally touched, they will jump to some websites or even an endless loop. Some mobile apps also have a lot of ads, so I have nothing to do but write a small program to avoid the annoyance of ads
This article will use php curl to collect the page simple_html_dom parsing to achieve true removal of ads.
Look for a book on any novel website, but this site is particularly tricky on mobile phones because of the above problems:
Just take this This novel will do the surgery. (Disclaimer: This is definitely not promotion, infringement or deletion)
1. Understand the get method of curl
curl is a command line tool that uploads or downloads through the specified URL data and display the data. The c in curl means client, and URL is the URL.
Using cURL in PHP can implement Get and Post request methods
A simple crawl of novels only requires the get method.
The following sample code is an example of obtaining the html of the first chapter novel page through a get request. You only need to change the url parameters.
Initialization, setting options, certificate verification, execution, closing
<?php header("Content-Type:text/html;charset=utf-8"); $url="https://www.7kzw.com/85/85445/27248636.html"; $ch = curl_init($url); //初始化 //设置选项 curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);//获取的信息以字符串返回,而不是直接输出(必须) curl_setopt($ch,CURLOPT_TIMEOUT,10);//超时时间(必须) curl_setopt($ch, CURLOPT_HEADER,0);// 启用时会将头文件的信息作为数据流输出。 //参数为1表示输出信息头,为0表示不输出 curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false); //不验证证书 // 3.执行 $res = curl_exec($ch); // 4.关闭 curl_close($ch); print_r($res); ?>
The comments are particularly detailed. Follow the steps to send a curl get request. If it is a post request, then You need to add an additional setting to set the post option, pass parameters, and finally output the obtained information. The running results are as follows, there is no css rendering.
2. Parse the page
The output page has a lot of unnecessary content and needs to be extracted from all the content To get the content we need, such as the title and the content of each chapter, we need to parse the page.
There are many ways to parse the page. Simple_html_dom is used here. You need to download and reference the simple_html_dom.php class, instance object, and call the internal method. For specific methods, you can check the official website or other documents on the Chinese website.
First analyze the source code of this novel page and look at the elements corresponding to the title and content of this chapter
The first is the title: under h1 under the class bookname
Then the content: Under the div with the id of content,
simple_html_dom can use the find method, similar to jquery. The selector finds the positioned element. For example:
find('.bookname h1'); //Find the h1 title element under class bookname
find('#content'); //Find The content of the chapter with the id of content
The code is added based on the above:
include "simple_html_dom.php"; $html = new simple_html_dom(); @$html->load($res); $h1 = $html->find('.bookname h1'); foreach ($h1 as $k=>$v) { $artic['title'] = $v->innertext; } // 查找小说的具体内容 $divs = $html->find('#content'); foreach ($divs as $k=>$v) { $content = $v->innertext; } // 正则替换去除多余部分 $pattern = "/(<p>.*?<\/p>)|(<div .*?>.*?<\/div>)/"; $artic['content'] = preg_replace($pattern,'',$content); echo $artic['title'].'<br>'; echo $artic['content'];
The content obtained by using the above parsing method is an array, use foreach To obtain the content of the array, regular replacement is used to remove the text advertisements in the text, and the title and novel content are placed in the array. The simplest way to write it is done. The running result is as follows:
# Of course, this way of writing looks uncomfortable, you can encapsulate the function class yourself. The following is a code example I wrote myself. Of course, there are definitely deficiencies, but it can be used as a reference for expansion.
<?php include "simple_html_dom.php"; include "mySpClass.php"; header("Content-Type:text/html;charset=utf-8"); $get_html = get_html($_GET['n']); $artic = getContent($get_html); echo $artic['title'].'<br>'; echo $artic['content']; /** * 获取www.7kzw.com 获取每一章的页面html * @param type $num 第几章,从第一开始(int) * @return 返回字符串 */ function get_html($num){ $start = 27248636; $real_num = $num+$start-1; $url = 'https://www.7kzw.com/85/85445/'.$real_num.'.html'; $header = [ 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0' ]; return mySpClass()->getCurl($url,$header); } /** * 获取www.7kzw.com小说标题数组 * @param type $get_html 得到的每一章的页面html * @return 返回$artic数组,['title'=>'','content'=>''] */ function getContent($get_html){ $html = new simple_html_dom(); @$html->load($get_html); $h1 = $html->find('.bookname h1'); foreach ($h1 as $k=>$v) { $artic['title'] = $v->innertext; } // 查找小说的具体内容 $divs = $html->find('#content'); foreach ($divs as $k=>$v) { $content = $v->innertext; } // 正则替换去除多余部分 $pattern = "/(<p>.*?<\/p>)|(<div .*?>.*?<\/div>)/"; $artic['content'] = preg_replace($pattern,'',$content); return $artic; } ?>
<?php class mySpClass{ //单例对象 private static $ins = null; /** * 单例化对象 */ public static function exec() { if (self::$ins) { return self::$ins; } return self::$ins = new self(); } /** * 禁止克隆对象 */ public function __clone() { throw new curlException('错误:不能克隆对象'); } // 向服务器发送最简单的get请求 public static function getCurl($url,$header){ // 1.初始化 $ch = curl_init($url); //请求的地址 // 2.设置选项 curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);//获取的信息以字符串返回,而不是直接输出(必须) curl_setopt($ch,CURLOPT_TIMEOUT,10);//超时时间(必须) curl_setopt($ch, CURLOPT_HEADER,0);// 启用时会将头文件的信息作为数据流输出。 //参数为1表示输出信息头,为0表示不输出 curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false); //不验证证书 curl_setopt($ch,CURLOPT_SSL_VERIFYHOST,false); //不验证证书 if(!empty($header)){ curl_setopt($ch,CURLOPT_HTTPHEADER,$header);//设置头信息 } // 3.执行 $res = curl_exec($ch); // 4.关闭 curl_close($ch); return $res; } } //curl方法不存在就设置一个curl方法 if (!function_exists('mySpClass')) { function mySpClass() { return mySpClass::exec(); } } ?>
The final running result of the above example code: enter the number in the chapter and pass the parameters through $_GET['n']
Summary:
Knowledge points: curl (tips: curl module collects any web page php class), regular, parsing tool simple_html_dom
Although the writing method has been initially improved , but it is best to deploy your own server to achieve the best results. Otherwise, you can only watch it on a computer, which is not very convenient. You may be more willing to tolerate advertisements.
The above are the details of using php curl to collect pages and using simple_html_dom to parse them. For more information, please pay attention to other related articles on the php Chinese website!
The above is the detailed content of Do programmers still read novels with advertisements?. For more information, please follow other related articles on the PHP Chinese website!