Home  >  Article  >  Backend Development  >  Example of how to use PHP to crawl Baidu Reading

Example of how to use PHP to crawl Baidu Reading

黄舟
黄舟Original
2017-02-23 09:27:211878browse

Preface

This article mainly introduces how to use PHP to capture Baidu Reading. Not much to say below, let’s take a look.

The crawling method is as follows

First open the reading page in the browser, check the source code and find that the content of the novel is not written directly on the page, that is It is said that the content of the novel is loaded asynchronously.

So I switched Chrome's developer tools to the network column and refreshed the reading page. The main focus was on the two categories of XHR and script.

After investigation, it was found that there was a jsonp request under the script category that looked more like the content of a novel. The requested address was
http://www.php.cn/
The response was a

jsonp

string, and then I found that if you remove the

callback=wenku7

in the address, a

json

string will be returned, which makes it much easier to parse, and you can directly Convert to array in php.

Let’s analyze the structure of the returned data. The returned

json

string is followed by a tree-like structure. Each node has a t attribute and a c attribute. The t attribute is used to indicate The label of this node, such as h2 p, etc., the c attribute is the content, but there are two possibilities, one is a string, the other is an array, and each element of the array is a node.

This kind of structure is best parsed, and it can be done with one recursion

The final code is as follows:

<?php
class BaiduYuedu {
 protected $bookId;
 protected $bookToken;
 protected $cookie;
 protected $result;
 public function __construct($bookId, $bookToken, $cookie){
  $this->bookId = $bookId;
  $this->bookToken = $bookToken;
  $this->cookie = $cookie;
 }
 public static function parseNode($node){
  $str = &#39;&#39;;
  if(is_string($node[&#39;c&#39;])){
   $str .= $node[&#39;c&#39;];
  }else if(is_array($node[&#39;c&#39;])){
   foreach($node[&#39;c&#39;] as $d){
    $str .= self::parseNode($d);
   }
  }
  switch($node[&#39;t&#39;]){
   case &#39;h2&#39;:
    $str .= "\n\n";
    break;
   case &#39;br&#39;:
   case &#39;p&#39;:
   case &#39;p&#39;:
    $str .= "\n";
    break;
   case &#39;img&#39;:
   case &#39;span&#39;:
    break;
   case &#39;obj&#39;:
    $tmp = &#39;(&#39; . self::parseNode($node[&#39;data&#39;][0]) . &#39;)&#39;;
    $str .= str_replace("\n", &#39;&#39;, $tmp);
    break;
   default:
    trigger_error(&#39;Unkown type:&#39;.$node[&#39;t&#39;], E_USER_WARNING);
    break;
  }
  return $str;
 }
 public function get($page = 1){
  echo "getting page {$page}...\n";
  $ch = curl_init();
  $url = sprintf(&#39;http://wenku.baidu.com/content/%s/?m=%s&type=json&cn=%d&#39;, $this->bookId, $this->token, $page);
  curl_setopt_array($ch, array(
   CURLOPT_URL   => $url,
   CURLOPT_RETURNTRANSFER => 1,
   CURLOPT_HEADER   => 0,
   CURLOPT_HTTPHEADER  => array(&#39;Cookie: &#39;. $this->cookie)
  ));
  $ret = json_decode(curl_exec($ch), true);
  curl_close($ch);
  $str = &#39;&#39;;
  if(!empty($ret)){
   $str .= self::parseNode($ret);
   $str .= $this->get($page + 1);
  }
  return $str;
 }
 public function start(){
  $this->result = $this->get();
 }
 public function getResult(){
  return $this->result;
 }
 public function saveTo($path){
  if(empty($this->result)){
   trigger_error(&#39;Result is empty&#39;, E_USER_ERROR);
   return;
  }
  file_put_contents($path, $this->result);
  echo "save to {$path}\n";
 }
}
//使用示例
$yuedu = new BaiduYuedu(&#39;49422a3769eae009581becba&#39;, &#39;8ed1dedb240b11bf0731336eff95093f&#39;, &#39;你的百度域cookie&#39;);
$yuedu->start();
$yuedu->saveTo(&#39;result.txt&#39;);



The first two parameters of this class can be obtained from the introduction page of the novel. The first parameter

bookId

is the string followed by

url

in

ebook

, the second parameter

bookToken

is searched for

bdjsonUrl

in the page source code, and the string after the

m

parameter is.

Note: If Baidu

cookie

is not passed in or Baidu

cookie

is invalid, only the free reading part can be captured, and the complete part must be captured The content must ensure that

cookie

can be used normally.

Summary

The above is an example of how to use PHP to crawl Baidu Reading. For more related content, please pay attention to the PHP Chinese website (www .php.cn)!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn