Python網路爬蟲之怎麼取得網路數據-Python教學-PHP中文網

首頁

後端開發

Python教學

Python網路爬蟲之怎麼取得網路數據

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 12, 2023 am 11:04 AM

python

使用 Python 取得網路資料

使用 Python 語言從網路上取得資料是一項非常常見的任務。 Python 有一個名為 requests 的函式庫，它是一個 Python 的 HTTP 用戶端函式庫，用於向 Web 伺服器發起 HTTP 請求。

我們可以透過以下程式碼使用 requests 函式庫向指定的 URL 發起 HTTP 請求：

import requests
response = requests.get(&#39;<http://www.example.com>&#39;)

其中，response 物件將包含伺服器傳回的回應。使用 response.text 可以取得回應的文字內容。

此外，我們還可以使用以下程式碼來取得二進位資源：

import requests
response = requests.get(&#39;<http://www.example.com/image.png>&#39;)
with open(&#39;image.png&#39;, &#39;wb&#39;) as f:
    f.write(response.content)

使用 response.content 可以取得伺服器傳回的二進位資料。

編寫爬蟲程式碼

爬蟲是一種自動化程序，可以透過網路爬取網頁數據，並將其儲存在資料庫或檔案中。爬蟲在資料收集、資訊監控、內容分析等領域有廣泛的應用。 Python 語言是爬蟲編寫的常用語言，因為它具有簡單易學、程式碼量少、函式庫豐富等優點。

我們以「豆瓣電影」為例，介紹如何使用 Python 寫爬蟲程式碼。首先，我們使用 requests 庫來取得網頁的 HTML 程式碼，然後將整個程式碼看成一個長字串，使用正規表示式的捕獲組從字串中提取所需的內容。

豆瓣電影Top250 頁面的網址是https://movie.douban.com/top250?start=0，其中start 參數表示從第幾個電影開始獲取。每頁共展示了25 部電影，如果要獲取Top250 數據，我們共需要訪問10 個頁面，對應的地址是https://movie.douban.com/top250?start=xxx，這裡的xxx 如果為0 就是第一頁，如果xxx 的值是100，那麼我們可以存取到第五頁。

我們以取得影片的標題和評分為例，程式碼如下所示：

import re
import requests
import time
import random
for page in range(1, 11):
    resp = requests.get(
        url=f&#39;<https://movie.douban.com/top250?start=>{(page - 1) * 25}&#39;,
        headers={&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36&#39;}
    )
    # 通过正则表达式获取class属性为title且标签体不以&开头的span标签并用捕获组提取标签内容
    pattern1 = re.compile(r&#39;<span class="title">([^&]*?)</span>&#39;)
    titles = pattern1.findall(resp.text)
    # 通过正则表达式获取class属性为rating_num的span标签并用捕获组提取标签内容
    pattern2 = re.compile(r&#39;<span class="rating_num".*?>(.*?)</span>&#39;)
    ranks = pattern2.findall(resp.text)
    # 使用zip压缩两个列表，循环遍历所有的电影标题和评分
    for title, rank in zip(titles, ranks):
        print(title, rank)
    # 随机休眠1-5秒，避免爬取页面过于频繁
    time.sleep(random.random() * 4 + 1)

在上述程式碼中，我們透過正規表示式取得標籤體為標題和評分的span 標籤，並用捕獲組提取標籤內容。使用 zip 壓縮兩個列表，循環遍歷所有電影標題和評分。

使用 IP 代理程式

許多網站對爬蟲程式比較反感，因為爬蟲程式會耗費掉它們很多的網路頻寬，並製造很多無效的流量。為了隱匿身份，通常需要使用 IP 代理來訪問網站。商業IP 代理程式（如蘑菇代理、芝麻代理、快代理等）是一個很好的選擇，使用商業IP 代理可以讓被爬取的網站無法獲取爬蟲程序來源的真實IP 地址，從而無法簡單的通過IP 地址對爬蟲程序進行封鎖。

以蘑菇代理為例，我們可以在該網站註冊一個帳號，然後購買相應的套餐來獲得商業 IP 代理。蘑菇代理提供了兩種存取代理的方式，分別是 API 私密代理和 HTTP 隧道代理，前者是透過請求蘑菇代理的 API 介面來取得代理伺服器位址，後者是直接使用統一的代理伺服器 IP 和連接埠。

使用IP 代理程式的程式碼如下所示：

import requests
proxies = {
    &#39;http&#39;: &#39;<http://username:password@ip>:port&#39;,
    &#39;https&#39;: &#39;<https://username:password@ip>:port&#39;
}
response = requests.get(&#39;<http://www.example.com>&#39;, proxies=proxies)

其中，username 和password 分別是蘑菇代理帳號的使用者名稱和密碼，ip 和port 分別是代理伺服器的IP 位址和連接埠號碼。請注意，不同的代理提供者的存取方式可能不同，需要根據實際情況進行相應的修改。

以上是Python網路爬蟲之怎麼取得網路數據的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文轉載於：亿速云。如有侵權，請聯絡admin@php.cn刪除

在Python陣列上可以執行哪些常見操作？Apr 26, 2025 am 12:22 AM

Pythonarrayssupportvariousoperations:1)Slicingextractssubsets,2)Appending/Extendingaddselements,3)Insertingplaceselementsatspecificpositions,4)Removingdeleteselements,5)Sorting/Reversingchangesorder,and6)Listcomprehensionscreatenewlistsbasedonexistin

在哪些類型的應用程序中，Numpy數組常用？Apr 26, 2025 am 12:13 AM

NumPyarraysareessentialforapplicationsrequiringefficientnumericalcomputationsanddatamanipulation.Theyarecrucialindatascience,machinelearning,physics,engineering,andfinanceduetotheirabilitytohandlelarge-scaledataefficiently.Forexample,infinancialanaly

您什麼時候選擇在Python中的列表上使用數組？Apr 26, 2025 am 12:12 AM

useanArray.ArarayoveralistinpythonwhendeAlingwithHomoGeneData，performance-Caliticalcode，orinterfacingwithccode.1）同質性data：arraysSaveMemorywithTypedElements.2）績效code-performance-calitialcode-calliginal-clitical-clitical-calligation-Critical-Code：Arraysofferferbetterperbetterperperformanceformanceformancefornallancefornalumericalical.3）

所有列表操作是否由數組支持，反之亦然？為什麼或為什麼不呢？Apr 26, 2025 am 12:05 AM

不，notalllistoperationsareSupportedByArrays，andviceversa.1）arraysdonotsupportdynamicoperationslikeappendorinsertwithoutresizing，wheremactsperformance.2）listssdonotguaranteeconecontanttanttanttanttanttanttanttanttanttimecomplecomecomplecomecomecomecomecomecomplecomectacccesslectaccesslecrectaccesslerikearraysodo。

您如何在python列表中訪問元素？Apr 26, 2025 am 12:03 AM

toAccesselementsInapythonlist，useIndIndexing，負索引，切片，口頭化。 1）indexingStartSat0.2）否定indexingAccessesessessessesfomtheend.3）slicingextractsportions.4）iterationerationUsistorationUsisturessoreTionsforloopsoreNumeratorseforeporloopsorenumerate.alwaysCheckListListListListlentePtotoVoidToavoIndexIndexIndexIndexIndexIndExerror。

Python的科學計算中如何使用陣列？Apr 25, 2025 am 12:28 AM

Arraysinpython，尤其是Vianumpy，ArecrucialInsCientificComputingfortheireftheireffertheireffertheirefferthe.1）Heasuedfornumerericalicerationalation，dataAnalysis和Machinelearning.2）Numpy'Simpy'Simpy'simplementIncressionSressirestrionsfasteroperoperoperationspasterationspasterationspasterationspasterationspasterationsthanpythonlists.3）inthanypythonlists.3）andAreseNableAblequick

您如何處理同一系統上的不同Python版本？Apr 25, 2025 am 12:24 AM

你可以通過使用pyenv、venv和Anaconda來管理不同的Python版本。 1）使用pyenv管理多個Python版本：安裝pyenv，設置全局和本地版本。 2）使用venv創建虛擬環境以隔離項目依賴。 3）使用Anaconda管理數據科學項目中的Python版本。 4）保留系統Python用於系統級任務。通過這些工具和策略，你可以有效地管理不同版本的Python，確保項目順利運行。

與標準Python陣列相比，使用Numpy數組的一些優點是什麼？Apr 25, 2025 am 12:21 AM

numpyarrayshaveseveraladagesoverandastardandpythonarrays：1）基於基於duetoc的iMplation，2）2）他們的aremoremoremorymorymoremorymoremorymoremorymoremoremory，尤其是WithlargedAtasets和3）效率化，效率化，矢量化函數函數函數函數構成和穩定性構成和穩定性的操作，製造

See all articles