淺談Python爬取網頁的編碼處理-Python教學-PHP中文網

首頁

後端開發

Python教學

淺談Python爬取網頁的編碼處理

高洛峰

Feb 22, 2017 am 11:13 AM

背景

中秋的時候，朋友給我發了一封郵件，說他在爬鍊家的時候，發現網頁回傳的程式碼都是亂碼，讓我幫他參謀參謀(中秋加班，真是敬業= =！)，其實這個問題我很早就遇到過，之前在爬小說的時候稍微看了一下，不過沒當回事，其實這個問題就是對編碼的理解不到位導致的。

問題

很普通的一個爬蟲程式碼，程式碼是這樣的：

# ecoding=utf-8
import re
import requests
import sys
reload(sys)
sys.setdefaultencoding(&#39;utf8&#39;)

url = &#39;http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/&#39;
res = requests.get(url)
print res.text

目的其實很簡單，就是爬一下鍊家的內容，但是這樣執行之後，返回的結果，所有涉及到中文的內容，全部會變成亂碼，比如這樣

淺談Python爬取網頁的編碼處理

<script type="text/template" id="newAddHouseTpl">
 <p class="newAddHouse">
  è‡ªä»Žæ‚¨ä¸Šæ¬¡æµè§ˆï¼ˆ<%=time%>ï¼‰ä¹‹åŽï¼Œè¯¥æœç´¢æ¡ä»¶ä¸‹æ–°å¢žåŠ äº†<%=count%>å¥—æˆ¿æº
  <a href="<%=url%>" class="LOGNEWERSHOUFANGSHOW" <%=logText%>><%=linkText%></a>
  <span class="newHouseRightClose">x</span>
 </p>
</script>

這樣的資料拿來可以說毫無作用。

問題分析

這裡的問題很明顯了，就是文字的編碼不正確，導致了亂碼。

查看網頁的編碼

從爬取的目標網頁的頭來看，網頁是用utf-8來編碼的。

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

所以，最終的編碼，我們肯定也要用utf-8來處理，也就是說，最終的文字處理，要用utf-8來解碼，也就是：decode('utf-8')

文字的編碼解碼##Python的編碼解碼的過程是這樣的，原始檔===” encode(編碼方式) ===》decode(解碼方式)，在很大的程度上，不建議使用

import sys
reload(sys)
sys.setdefaultencoding(&#39;utf8&#39;)

這種方式來硬處理文字編碼。不過在某些時候不影響的情況下，偷偷懶也不是什麼大問題，不過比較建議的就是取得原始檔案之後，使用encode和decode的方式來處理文字。

回到問題現在問題最大的是原始檔的編碼方式，我們正常使用requests的時候，它會自動猜源檔案的編碼方式，然後轉碼成Unicode的編碼，但是，畢竟是程序，是有可能猜錯的，所以如果猜錯了，我們就需要手工來指定編碼方式。官方文件的描述如下：

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access. when. can find out what encoding Requests is using, and change it, using the r.encoding property.

所以我們需要查看requests回傳的編碼方式到底是什麼？

# ecoding=utf-8
import re
import requests
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding(&#39;utf8&#39;)

url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get(url)
print res.encoding

列印的結果如下：

ISO-8859-1

也就是說，原始檔使用的是ISO-8859-1來編碼。百度一下ISO-8859-1，結果如下：

ISO8859-1，通常叫做Latin-1。 Latin-1包括了書寫所有西方歐洲語言中不可缺少的附加字元。

問題解決

發現了這個東東，問題就很好解決了，只要指定一下編碼，就能正確的打出中文了。程式碼如下：

# ecoding=utf-8
import requests
import sys
reload(sys)
sys.setdefaultencoding(&#39;utf8&#39;)

url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get(url)
res.encoding = ('utf8')

print res.text

列印的結果就很明顯，中文都正確的顯示出來了。

淺談Python爬取網頁的編碼處理另一種方式是在原始檔上做解碼和編碼，程式碼如下：

# ecoding=utf-8
import requests
import sys
reload(sys)
sys.setdefaultencoding(&#39;utf8&#39;)

url = 'http://jb51.net/ershoufang/rs%E6%8B%9B%E5%95%86%E6%9E%9C%E5%B2%AD/'

res = requests.get(url)
# res.encoding = ('utf8')

print res.text.encode('ISO-8859-1').decode('utf-8')

另：ISO-8859-1也叫latin1，使用latin1做解碼結果也是正常的。

關於字元的編碼，很多東西可以說，想了解的朋友可以參考以下大神的資料。

•《The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)》

#以上這篇淺談Python爬取網頁的編碼處理就是小編分享給大家的全部內容了，希望能給大家一個參考，也希望大家多多支持PHP中文網。

更多淺談Python爬取網頁的編碼處理相關文章請關注PHP中文網！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

在Python陣列上可以執行哪些常見操作？Apr 26, 2025 am 12:22 AM

Pythonarrayssupportvariousoperations:1)Slicingextractssubsets,2)Appending/Extendingaddselements,3)Insertingplaceselementsatspecificpositions,4)Removingdeleteselements,5)Sorting/Reversingchangesorder,and6)Listcomprehensionscreatenewlistsbasedonexistin

在哪些類型的應用程序中，Numpy數組常用？Apr 26, 2025 am 12:13 AM

NumPyarraysareessentialforapplicationsrequiringefficientnumericalcomputationsanddatamanipulation.Theyarecrucialindatascience,machinelearning,physics,engineering,andfinanceduetotheirabilitytohandlelarge-scaledataefficiently.Forexample,infinancialanaly

您什麼時候選擇在Python中的列表上使用數組？Apr 26, 2025 am 12:12 AM

useanArray.ArarayoveralistinpythonwhendeAlingwithHomoGeneData，performance-Caliticalcode，orinterfacingwithccode.1）同質性data：arraysSaveMemorywithTypedElements.2）績效code-performance-calitialcode-calliginal-clitical-clitical-calligation-Critical-Code：Arraysofferferbetterperbetterperperformanceformanceformancefornallancefornalumericalical.3）

所有列表操作是否由數組支持，反之亦然？為什麼或為什麼不呢？Apr 26, 2025 am 12:05 AM

不，notalllistoperationsareSupportedByArrays，andviceversa.1）arraysdonotsupportdynamicoperationslikeappendorinsertwithoutresizing，wheremactsperformance.2）listssdonotguaranteeconecontanttanttanttanttanttanttanttanttanttimecomplecomecomplecomecomecomecomecomecomplecomectacccesslectaccesslecrectaccesslerikearraysodo。

您如何在python列表中訪問元素？Apr 26, 2025 am 12:03 AM

toAccesselementsInapythonlist，useIndIndexing，負索引，切片，口頭化。 1）indexingStartSat0.2）否定indexingAccessesessessessesfomtheend.3）slicingextractsportions.4）iterationerationUsistorationUsisturessoreTionsforloopsoreNumeratorseforeporloopsorenumerate.alwaysCheckListListListListlentePtotoVoidToavoIndexIndexIndexIndexIndexIndExerror。

Python的科學計算中如何使用陣列？Apr 25, 2025 am 12:28 AM

Arraysinpython，尤其是Vianumpy，ArecrucialInsCientificComputingfortheireftheireffertheireffertheirefferthe.1）Heasuedfornumerericalicerationalation，dataAnalysis和Machinelearning.2）Numpy'Simpy'Simpy'simplementIncressionSressirestrionsfasteroperoperoperationspasterationspasterationspasterationspasterationspasterationsthanpythonlists.3）inthanypythonlists.3）andAreseNableAblequick

您如何處理同一系統上的不同Python版本？Apr 25, 2025 am 12:24 AM

你可以通過使用pyenv、venv和Anaconda來管理不同的Python版本。 1）使用pyenv管理多個Python版本：安裝pyenv，設置全局和本地版本。 2）使用venv創建虛擬環境以隔離項目依賴。 3）使用Anaconda管理數據科學項目中的Python版本。 4）保留系統Python用於系統級任務。通過這些工具和策略，你可以有效地管理不同版本的Python，確保項目順利運行。

與標準Python陣列相比，使用Numpy數組的一些優點是什麼？Apr 25, 2025 am 12:21 AM

numpyarrayshaveseveraladagesoverandastardandpythonarrays：1）基於基於duetoc的iMplation，2）2）他們的aremoremoremorymorymoremorymoremorymoremorymoremoremory，尤其是WithlargedAtasets和3）效率化，效率化，矢量化函數函數函數函數構成和穩定性構成和穩定性的操作，製造

See all articles