집 >백엔드 개발 >파이썬 튜토리얼 >Python 크롤러 베타 버전은 Zhihu 단일 페이지를 크롤링합니다.

Python 크롤러 베타 버전은 Zhihu 단일 페이지를 크롤링합니다.

高洛峰원래의: 2016-12-02 16:51:451909검색

이전에 Python을 사용하여 크롤러를 작성하고 운영자가 JD.com의 제품 브랜드 및 카테고리를 크롤링하는 데 도움을 주었기 때문에 이번에도 Python을 사용하여 간단한 단일 페이지 버전을 나중에 추가하겠습니다.

#-*- coding: UTF-8 -*- 
import requests
import sys
from bs4 import BeautifulSoup

#－－－－－－知乎答案收集－－－－－－－－－－

#获取网页body里的内容
def get_content(url , data = None):
    header={
        &#39;Accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8&#39;,
        &#39;Accept-Encoding&#39;: &#39;gzip, deflate, sdch&#39;,
        &#39;Accept-Language&#39;: &#39;zh-CN,zh;q=0.8&#39;,
        &#39;Connection&#39;: &#39;keep-alive&#39;,
        &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235&#39;
    }

    req = requests.get(url, headers=header)
    req.encoding = &#39;utf-8&#39;
    bs = BeautifulSoup(req.text, "html.parser")  # 创建BeautifulSoup对象
    body = bs.body # 获取body部分
    return body

#获取问题标题
def get_title(html_text):
     data = html_text.find(&#39;span&#39;, {&#39;class&#39;: &#39;zm-editable-content&#39;})
     return data.string.encode(&#39;utf-8&#39;)

#获取问题内容
def get_question_content(html_text):
     data = html_text.find(&#39;div&#39;, {&#39;class&#39;: &#39;zm-editable-content&#39;})
     if data.string is None:
         out = &#39;&#39;;
         for datastring in data.strings:
             out = out + datastring.encode(&#39;utf-8&#39;)
         print &#39;内容：\n&#39; + out
     else:
         print &#39;内容：\n&#39; + data.string.encode(&#39;utf-8&#39;)

#获取点赞数
def get_answer_agree(body):
    agree = body.find(&#39;span&#39;,{&#39;class&#39;: &#39;count&#39;})
    print &#39;点赞数：&#39; + agree.string.encode(&#39;utf-8&#39;) + &#39;\n&#39;

#获取答案
def get_response(html_text):
     response = html_text.find_all(&#39;div&#39;, {&#39;class&#39;: &#39;zh-summary summary clearfix&#39;})
     for index in range(len(response)):
         #获取标签
         answerhref = response[index].find(&#39;a&#39;, {&#39;class&#39;: &#39;toggle-expand&#39;})
         if not(answerhref[&#39;href&#39;].startswith(&#39;javascript&#39;)):
             url = &#39;http://www.zhihu.com/&#39; + answerhref[&#39;href&#39;]
             print url
             body = get_content(url)
             get_answer_agree(body)
             answer = body.find(&#39;div&#39;, {&#39;class&#39;: &#39;zm-editable-content clearfix&#39;})
             if answer.string is None:
                 out = &#39;&#39;;
                 for datastring in answer.strings:
                     out = out + &#39;\n&#39; + datastring.encode(&#39;utf-8&#39;)
                 print out
             else:
                 print answer.string.encode(&#39;utf-8&#39;)


html_text = get_content(&#39;https://www.zhihu.com/question/43879769&#39;)
title = get_title(html_text)
print "标题：\n" + title + &#39;\n&#39;
questiondata = get_question_content(html_text)
print &#39;\n&#39;
data = get_response(html_text)

출력 결과:

Python 크롤러 베타 버전은 Zhihu 단일 페이지를 크롤링합니다.

성명：

이전 기사：파이썬에서 round(x,[n]) 사용다음 기사：파이썬에서 round(x,[n]) 사용

Python 크롤러 베타 버전은 Zhihu 단일 페이지를 크롤링합니다.

관련 기사