python爬取文章實例教程-Python教學-PHP中文網

首頁

後端開發

Python教學

python爬取文章實例教程

巴扎黑

Aug 07, 2017 pm 05:37 PM

python教學文章

這篇文章主要跟大家介紹了利用python爬取散文網文章的相關資料，文中介紹的非常詳細，對大家具有一定的參考學習價值，需要的朋友們下面來一起看看吧。

本文主要介紹給大家的是python爬取散文網文章的相關內容，分享出來供大家參考學習，下面一起來看看詳細的介紹：

效果圖如下：

#設定python 2.7

# #

 bs4

 requests

安裝用pip進行安裝

sudo pip install bs4

sudo pip install requests

簡單說明一下bs4的使用因為是爬取網頁所以就介紹find 跟find_all

find跟find_all的不同在於返回的東西不同find返回的是匹配到的第一個標籤及標籤裡的內容

find_all返回的是一個列表

例如我們寫一個test.html 用來測試find跟find_all的差別。

內容是：

<html>
<head>
</head>
<body>
<p id="one"><a></a></p>
<p id="two"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >abc</a></p>
<p id="three"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >three a</a></p>
<p id="four"><a href="#" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >four<p>four p</p><p>four p</p><p>four p</p> a</a></p>
</body>
</html>

然後test.py的程式碼為：

from bs4 import BeautifulSoup
import lxml

if __name__==&#39;__main__&#39;:
 s = BeautifulSoup(open(&#39;test.html&#39;),&#39;lxml&#39;)
 print s.prettify()
 print "------------------------------"
 print s.find(&#39;p&#39;)
 print s.find_all(&#39;p&#39;)
 print "------------------------------"
 print s.find(&#39;p&#39;,id=&#39;one&#39;)
 print s.find_all(&#39;p&#39;,id=&#39;one&#39;)
 print "------------------------------"
 print s.find(&#39;p&#39;,id="two")
 print s.find_all(&#39;p&#39;,id="two")
 print "------------------------------"
 print s.find(&#39;p&#39;,id="three")
 print s.find_all(&#39;p&#39;,id="three")
 print "------------------------------"
 print s.find(&#39;p&#39;,id="four")
 print s.find_all(&#39;p&#39;,id="four")
 print "------------------------------"

運行以後我們可以看到結果當取得指定標籤時候兩者差異不大當取得一組標籤的時候兩者的差異就會顯示出來

所以我們在使用時候要注意到底要的是什麼，否則會出現報錯

#接下來就是透過requests 取得網頁資訊了，我不太懂別人為什麼要寫heard跟其他的東西

我直接進行網頁訪問，通過get方式獲取散文網幾個分類的二級網頁然後通過一個組的測試，把所有的網頁爬取一遍

def get_html():
 url = "https://www.sanwen.net/"
 two_html = [&#39;sanwen&#39;,&#39;shige&#39;,&#39;zawen&#39;,&#39;suibi&#39;,&#39;rizhi&#39;,&#39;novel&#39;]
 for doc in two_html:
 i=1
  if doc==&#39;sanwen&#39;:
  print "running sanwen -----------------------------"
  if doc==&#39;shige&#39;:
  print "running shige ------------------------------"
  if doc==&#39;zawen&#39;:
  print &#39;running zawen -------------------------------&#39;
  if doc==&#39;suibi&#39;:
  print &#39;running suibi -------------------------------&#39;
  if doc==&#39;rizhi&#39;:
  print &#39;running ruzhi -------------------------------&#39;
  if doc==&#39;nove&#39;:
  print &#39;running xiaoxiaoshuo -------------------------&#39;
 while(i<10):
 par = {&#39;p&#39;:i}
 res = requests.get(url+doc+&#39;/&#39;,params=par)
 if res.status_code==200:
  soup(res.text)
  i+=i

這部分的程式碼中我沒有對

res.status_code不是200的進行處理，導致的問題是會不顯示錯誤，爬取的內容會有丟失。然後分析散文網的網頁，發現是www.sanwen.net/rizhi/&p=1

p最大值是10這個不太懂，上次爬盤多多是100頁，算了算了以後再分析。然後就透過get方法取得每頁的內容。

#取得每頁內容以後就是分析作者跟題目了代碼是這樣的

def soup(html_text):
 s = BeautifulSoup(html_text,&#39;lxml&#39;)
 link = s.find(&#39;p&#39;,class_=&#39;categorylist&#39;).find_all(&#39;li&#39;)
 for i in link:
 if i!=s.find(&#39;li&#39;,class_=&#39;page&#39;):
 title = i.find_all(&#39;a&#39;)[1]
 author = i.find_all(&#39;a&#39;)[2].text
 url = title.attrs[&#39;href&#39;]
 sign = re.compile(r&#39;(//)|/&#39;)
 match = sign.search(title.text)
 file_name = title.text
 if match:
 file_name = sign.sub(&#39;a&#39;,str(title.text))

取得標題的時候出現坑爹的事，請問大佬們寫散文你標題加斜杠幹嘛，不光加一個還有加兩個的，這個問題直接導致我後面寫入文件的時候文件名出現錯誤，於是寫正則表達式，我給你改行了吧。

最後就是取得散文內容了，透過每頁的分析，取得文章地址，然後直接取得內容，本來還想直接透過改網頁位址一個一個的取得呢，這樣也省事了。

def get_content(url):
 res = requests.get(&#39;https://www.sanwen.net&#39;+url)
 if res.status_code==200:
 soup = BeautifulSoup(res.text,&#39;lxml&#39;)
 contents = soup.find(&#39;p&#39;,class_=&#39;content&#39;).find_all(&#39;p&#39;)
 content = &#39;&#39;
 for i in contents:
 content+=i.text+&#39;\n&#39;
 return content

最後就是寫入檔案保存ok

 f = open(file_name+&#39;.txt&#39;,&#39;w&#39;)

 print &#39;running w txt&#39;+file_name+&#39;.txt&#39;
 f.write(title.text+&#39;\n&#39;)
 f.write(author+&#39;\n&#39;)
 content=get_content(url) 
 f.write(content)
 f.close()

三個函數取得散文網的散文，不過有問題，問題在於不知道為什麼有些散文丟失了我只能獲取到大概400多篇文章，這跟散文網的文章是差很多很多的，但是確實是一頁一頁的獲取來的，這個問題希望大佬幫忙看看。可能應該要做網頁無法存取的處理，當然我覺得跟我宿捨這個破網有關係

 f = open(file_name+&#39;.txt&#39;,&#39;w&#39;)
 print &#39;running w txt&#39;+file_name+&#39;.txt&#39;
 f.write(title.text+&#39;\n&#39;)
 f.write(author+&#39;\n&#39;)
 content=get_content(url) 
 f.write(content)
 f.close()

差點忘了效果圖

能會出現timeout現象吧，只能說上大學一定要選網好的啊！

以上是python爬取文章實例教程的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

Python和時間：充分利用您的學習時間Apr 14, 2025 am 12:02 AM

要在有限的時間內最大化學習Python的效率，可以使用Python的datetime、time和schedule模塊。 1.datetime模塊用於記錄和規劃學習時間。 2.time模塊幫助設置學習和休息時間。 3.schedule模塊自動化安排每週學習任務。

Python：遊戲，Guis等Apr 13, 2025 am 12:14 AM

Python在遊戲和GUI開發中表現出色。 1)遊戲開發使用Pygame，提供繪圖、音頻等功能，適合創建2D遊戲。 2)GUI開發可選擇Tkinter或PyQt，Tkinter簡單易用，PyQt功能豐富，適合專業開發。

Python vs.C：申請和用例Apr 12, 2025 am 12:01 AM

Python适合数据科学、Web开发和自动化任务，而C 适用于系统编程、游戏开发和嵌入式系统。Python以简洁和强大的生态系统著称，C 则以高性能和底层控制能力闻名。

2小時的Python計劃：一種現實的方法Apr 11, 2025 am 12:04 AM

2小時內可以學會Python的基本編程概念和技能。 1.學習變量和數據類型，2.掌握控制流（條件語句和循環），3.理解函數的定義和使用，4.通過簡單示例和代碼片段快速上手Python編程。

Python：探索其主要應用程序Apr 10, 2025 am 09:41 AM

Python在web開發、數據科學、機器學習、自動化和腳本編寫等領域有廣泛應用。 1)在web開發中，Django和Flask框架簡化了開發過程。 2)數據科學和機器學習領域，NumPy、Pandas、Scikit-learn和TensorFlow庫提供了強大支持。 3)自動化和腳本編寫方面，Python適用於自動化測試和系統管理等任務。