This time crawled 1.1 million user data, the data analysis results are as follows:

Preparation before development

  • Install a Linux system (Ubuntu14.04) and install an Ubuntu under the VMWare virtual machine;
  • Install PHP5.6 or above;
  • Install MySQL5.5 or above;
  • Install curl and pcntl extensions.

Use PHP’s curl extension to capture page data

PHP’s curl extension is a library supported by PHP that allows you to connect and communicate with various servers using various types of protocols.

This program captures Zhihu user data. To be able to access the user's personal page, the user needs to be logged in before accessing. When we click a user avatar link on the browser page to enter the user's personal center page, the reason why we can see the user's information is because when we click the link, the browser helps you bring the local cookies and submit them together. Go to a new page, so you can enter the user's personal center page. Therefore, before accessing the personal page, you need to obtain the user's cookie information, and then bring the cookie information with each curl request. In terms of obtaining cookie information, I used my own cookie. You can see your cookie information on the page:

Copy them one by one to form a cookie string in the form of "__utma=?;__utmb=?;". This cookie string can then be used to send requests.

Initial example:

$url = 'http://www.zhihu.com/people/mora-hu/about'; //此处mora-hu代表用户ID
$ch = curl_init($url); //初始化会话
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_COOKIE, $this->config_arr['user_cookie']); //设置请求COOKIE
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); //将curl_exec()获取的信息以文件流的形式返回,而不是直接输出。
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); 
$result = curl_exec($ch);
return $result; //抓取的结果

Run the above code to get the personal center page of mora-hu user. Using this result and then using regular expressions to process the page, you can obtain the name, gender and other information that needs to be captured.

1. Picture hotlink protection

When outputting personal information after regularizing the return results, it was found that the user's avatar cannot be opened when outputting it on the page. After reviewing the information, I found out that it was because Zhihu had protected the pictures from hotlinking. The solution is to forge a referer in the request header when requesting an image.

After using the regular expression to obtain the link to the image, send another request. At this time, bring the source of the image request, indicating that the request is forwarded from the Zhihu website. Specific examples are as follows:

function getImg($url, $u_id)
  if (file_exists('./images/' . $u_id . ".jpg"))
    return "images/$u_id" . '.jpg';
  if (empty($url))
    return '';
  $context_options = array( 
    'http' => 
      'header' => "Referer:http://www.zhihu.com"//带上referer参数

  $context = stream_context_create($context_options); 
  $img = file_get_contents('http:' . $url, FALSE, $context);
  file_put_contents('./images/' . $u_id . ".jpg", $img);
  return "images/$u_id" . '.jpg';

2. Crawl more users

After capturing your personal information, you need to access the user's followers and followed user lists to obtain more user information. Then visit layer by layer. As you can see, in the personal center page, there are two links as follows:

There are two links here, one is followed and the other is followers, taking the "followed" link as an example. Use regular matching to match the corresponding link. After getting the URL, use curl to bring the cookie and send another request. After crawling the list page that the user has followed, you can get the following page:

Analyze the HTML structure of the page. Because you only need to get the user's information, you only need to frame the div content, and the user name is in it. As you can see, the URL of the page that the user followed is:

This URL is almost the same for different users. The difference lies in the username. Use regular matching to get the username list, spell the URLs one by one, and then send requests one by one (of course, one by one is slower, there is a solution below, which will be discussed later). After entering the new user's page, repeat the above steps, and continue in this loop until you reach the amount of data you want.

3. Number of Linux statistics files

After the script has been running for a while, you need to see how many pictures have been obtained. When the amount of data is relatively large, it is a bit slow to open the folder to check the number of pictures. The script is run in a Linux environment, so you can use Linux commands to count the number of files:

ls -l | grep "^-" | wc -l

其中, ls -l 是长列表输出该目录下的文件信息(这里的文件可以是目录、链接、设备文件等); grep "^-" 过滤长列表输出信息, "^-" 只保留一般文件,如果只保留目录是 "^d" ; wc -l 是统计输出信息的行数。下面是一个运行示例:





3)添加唯一索引,插入时使用 INSERT INGNORE INTO...

4)添加唯一索引,插入时使用 REPLACE INTO...

第一种方案是最简单但也是效率最差的方案,因此不采取。二和四方案的执行结果是一样的,不同的是,在遇到相同的数据时, INSERT INTO … ON DUPLICATE KEY UPDATE 是直接更新的,而 REPLACE INTO 是先删除旧的数据然后插入新的,在这个过程中,还需要重新维护索引,所以速度慢。所以在二和四两者间选择了第二种方案。而第三种方案, INSERT INGNORE 会忽略执行INSERT语句出现的错误,不会忽略语法问题,但是忽略主键存在的情况。这样一来,使用 INSERT INGNORE 就更好了。最终,考虑到要在数据库中记录重复数据的条数,因此在程序中采用了第二种方案。



  $mh = curl_multi_init(); //返回一个新cURL批处理句柄
  for ($i = 0; $i < $max_size; $i++)
    $ch = curl_init(); //初始化单个cURL会话
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_URL, 'http://www.zhihu.com/people/' . $user_list[$i] . '/about');
    curl_setopt($ch, CURLOPT_COOKIE, self::$user_cookie);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    $requestMap[$i] = $ch;
    curl_multi_add_handle($mh, $ch); //向curl批处理会话中添加单独的curl句柄

  $user_arr = array();
  do {
          //运行当前 cURL 句柄的子连接
    while (($cme = curl_multi_exec($mh, $active)) == CURLM_CALL_MULTI_PERFORM);

    if ($cme != CURLM_OK) {break;}
    while ($done = curl_multi_info_read($mh))
      $info = curl_getinfo($done['handle']);
      $tmp_result = curl_multi_getcontent($done['handle']);
      $error = curl_error($done['handle']);

      $user_arr[] = array_values(getUserInfo($tmp_result));

      if ($i < sizeof($user_list) && isset($user_list[$i]) && $i < count($user_list))
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_URL, 'http://www.zhihu.com/people/' . $user_list[$i] . '/about');
        curl_setopt($ch, CURLOPT_COOKIE, self::$user_cookie);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        $requestMap[$i] = $ch;
        curl_multi_add_handle($mh, $ch);


      curl_multi_remove_handle($mh, $done['handle']);

    if ($active)
      curl_multi_select($mh, 10);
  } while ($active);

  return $user_arr;

6、HTTP 429 Too Many Requests

使用curl_multi函数可以同时发多个请求,但是在执行过程中使同时发200个请求的时候,发现很多请求无法返回了,即发现了丢包的情况。进一步分析,使用 curl_getinfo 函数打印每个请求句柄信息,该函数返回一个包含HTTP response信息的关联数组,其中有一个字段是http_code,表示请求返回的HTTP状态码。看到有很多个请求的http_code都是429,这个返回码的意思是发送太多请求了。我猜是知乎做了防爬虫的防护,于是我就拿其他的网站来做测试,发现一次性发200个请求时没问题的,证明了我的猜测,知乎在这方面做了防护,即一次性的请求数量是有限制的。于是我不断地减少请求数量,发现在5的时候就没有丢包情况了。说明在这个程序里一次性最多只能发5个请求,虽然不多,但这也是一次小提升了。




  $redis = new Redis();
  $redis->connect('', '6379');
  $redis->set('tmp', 'value');
  if ($redis->exists('tmp'))
    echo $redis->get('tmp') . "\n";



for ($i = 0; $i < 10; $i++) {
  $pid = pcntl_fork();
  if ($pid == -1) {
    echo "Could not fork!\n";
  if (!$pid) {
    echo "child process $i running\n";

while (pcntl_waitpid(0, $status) != -1) {
  $status = pcntl_wexitstatus($status);
  echo "Child $status completed\n";



cat /proc/cpuinfo


其中,model name表示cpu类型信息,cpu cores表示cpu核数。这里的核数是1,因为是在虚拟机下运行,分配到的cpu核数比较少,因此只能开2条进程。最终的结果是,用了一个周末就抓取了110万的用户数据。


在多进程条件下,程序运行了一段时间后,发现数据不能插入到数据库,会报mysql too many connections的错误,redis也是如此。


   for ($i = 0; $i < 10; $i++) {
     $pid = pcntl_fork();
     if ($pid == -1) {
        echo "Could not fork!\n";
     if (!$pid) {
        $redis = PRedis::getInstance();
        // do something   


解决方法: >程序不能完全保证在fork进程之前,父进程不会创建redis连接实例。因此,要解决这个问题只能靠子进程本身了。试想一下,如果在子进程中获取的实例只与当前进程相关,那么这个问题就不存在了。于是解决方案就是稍微改造一下redis类实例化的静态方式,与当前进程ID绑定起来。


   public static function getInstance() {
     static $instances = array();
     $key = getmypid();//获取当前进程ID
     if ($empty($instances[$key])) {
        $inctances[$key] = new self();

     return $instances[$key];



function microtime_float()
   list($u_sec, $sec) = explode(' ', microtime());
   return (floatval($u_sec) + floatval($sec));

$start_time = microtime_float();

//do something

$end_time = microtime_float();
$total_time = $end_time - $start_time;

$time_cost = sprintf("%.10f", $total_time);

echo "program cost total " . $time_cost . "s\n";



