Home >Backend Development >PHP Tutorial >Use PHP to implement a program to capture Zhihu questions and answers

Use PHP to implement a program to capture Zhihu questions and answers

王林
王林Original
2023-06-13 23:21:211052browse

Zhihu is a very popular knowledge sharing community. Many users have contributed a large number of high-quality questions and answers. For people who study and work, this content is very helpful for solving problems and expanding their horizons. If you want to organize and utilize this content, you need to use scrapers to obtain relevant data. This article will introduce how to use PHP to write a program to crawl Zhihu questions and answers.

  1. Introduction
    Zhihu is a platform with rich content, including but not limited to questions, answers, columns, topics, users, etc. We can further explore the value of these contents by crawling data on Zhihu. Here we mainly introduce how to use PHP to crawl Zhihu questions and answers.
  2. Problem crawling
    First of all, we need to clarify what the goal of crawling is. For questions on Zhihu, we need the following information:

Question title
Question description
The number of followers, views, and answers to the question
Question tag
Related Questions
Questions on Zhihu have a very obvious feature, that is, each question has a unique URL. So we can get information about the problem by constructing a URL and sending an HTTP request.

The following is a PHP code demonstration:

<?php
$url = 'https://www.zhihu.com/question/36189228';
$html = file_get_contents($url);

$data = array();
preg_match('/<title>(.*?)</title>/', $html, $match);
$data['title'] = $match[1];

preg_match('/<div class="QuestionHeader-detail">(.*?)</div>/', $html, $match);
$data['description'] = $match[1];

preg_match('/<div class="NumberBoard-value">(.*?)</div><span class="NumberBoard-label">关注者</span>/', $html, $match);
$data['followers'] = $match[1];

preg_match('/<div class="NumberBoard-value">(.*?)</div><span class="NumberBoard-label">浏览</span>/', $html, $match);
$data['views'] = $match[1];

preg_match('/<div class="NumberBoard-value">(.*?)</div><div class="NumberBoard-label">回答</div>/', $html, $match);
$data['answers'] = $match[1];

preg_match_all('/<a href="/topic/(.*?)">(.*?)</a>/', $html, $matches);
$data['tags'] = implode(',', $matches[2]);

preg_match_all('/<a class="RelatedQuestionItem-title" href="(.*?)" target="_blank">(.*?)</a>/', $html, $matches);
$data['related_questions'] = array_combine($matches[1], $matches[2]);

echo json_encode($data, JSON_UNESCAPED_UNICODE);

PHP's regular expression is used here to match the required information in the HTML text. Although this method depends on the HTML page structure, it can normally capture the required data in most cases. It can be seen that through simple code, we can obtain various information about this problem.

  1. Answer capture
    For answers on Zhihu, we need the following information:

The author of the answer
The content of the answer
The answer Number of likes and comments
For each answer, we can also obtain its related information by constructing a URL and sending an HTTP request.

The following is a PHP code demonstration:

<?php
$url = 'https://www.zhihu.com/question/36189228/answer/243147352';
$html = file_get_contents($url);

$data = array();
preg_match('/<meta itemprop="name" content="(.*?)">/', $html, $match);
$data['author'] = $match[1];

preg_match('/<div class="RichText ztext">(.*?)</div>/', $html, $match);
$data['content'] = $match[1];

preg_match('/<button class="Button VoteButton VoteButton--up" aria-pressed="false" tabindex="0" aria-label="(.*?)">/', $html, $match);
$data['upvotes'] = $match[1];

preg_match('/<button class="Button CommentButton" tabindex="0" aria-label="(.*?)">/', $html, $match);
$data['comments'] = $match[1];

echo json_encode($data, JSON_UNESCAPED_UNICODE);

Similarly, we used PHP's regular expressions to match the required information in the HTML text. It is worth noting that getting the content of the answer requires using ztext instead of the AnswerItem-content class. This is because Zhihu changed the relevant CSS class names after the update.

  1. Summary
    This article introduces how to use PHP to write a program that captures Zhihu questions and answers. We can obtain different information as needed and conduct comprehensive analysis and utilization of the content on Zhihu. For PHP developers, this is a very practical skill that can be used in data analysis, search engine optimization and other aspects of work.

The above is the detailed content of Use PHP to implement a program to capture Zhihu questions and answers. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn