詳細教學：不使用 API 爬取 GitHub 儲存庫資料夾-Python教學-PHP中文網

首頁

後端開發

Python教學

詳細教學：不使用 API 爬取 GitHub 儲存庫資料夾

Barbara Streisand

Dec 16, 2024 am 06:28 AM

Detailed Tutorial: Crawling GitHub Repository Folders Without API

超詳細教學：不使用 API 爬取 GitHub 儲存庫資料夾

這個超詳細的教學由 Shpetim Haxhiu 撰寫，將引導您以程式設計方式爬取 GitHub 儲存庫資料夾，而無需依賴 GitHub API。它包括從理解結構到提供具有增強功能的健壯的遞歸實現的所有內容。

1.設定與安裝

開始之前，請確保您已：

Python：已安裝版本 3.7 或更高版本。
庫：安裝請求和BeautifulSoup。

   pip install requests beautifulsoup4

編輯器：任何支援 Python 的 IDE，例如 VS Code 或 PyCharm。

2.分析 GitHub HTML 結構

要抓取 GitHub 資料夾，您需要了解儲存庫頁面的 HTML 結構。在 GitHub 儲存庫頁面上：

資料夾 與 /tree// 等路徑連結。
檔案與 /blob// 等路徑連結。

每個項目（資料夾或檔案）都位於

內具有屬性 role="rowheader" 並包含 ;標籤。例如：

<div role="rowheader">
  <a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>

3.實作抓取器

3.1。遞歸爬取函數

該腳本將遞歸地抓取資料夾並列印其結構。為了限制遞歸深度並避免不必要的負載，我們將使用深度參數。

import requests
from bs4 import BeautifulSoup
import time

def crawl_github_folder(url, depth=0, max_depth=3):
    """
    Recursively crawls a GitHub repository folder structure.

    Parameters:
    - url (str): URL of the GitHub folder to scrape.
    - depth (int): Current recursion depth.
    - max_depth (int): Maximum depth to recurse.
    """
    if depth > max_depth:
        return

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url} (Status code: {response.status_code})")
        return

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract folder and file links
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            print(f"{'  ' * depth}Folder: {item_name}")
            crawl_github_folder(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            print(f"{'  ' * depth}File: {item_name}")

# Example usage
if __name__ == "__main__":
    repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
    crawl_github_folder(repo_url)
</folder></branch></repo></owner>

4.功能解釋

請求標頭：使用使用者代理字串來模擬瀏覽器並避免阻塞。
遞歸爬行：
- 偵測資料夾 (/tree/) 並遞歸地輸入它們。
- 列出檔案 (/blob/)，無需進一步輸入。
縮排：反映輸出中的資料夾層次結構。
深度限制：透過設定最大深度（max_深度）來防止過度遞歸。

5.增強功能

這些增強功能旨在提高爬蟲程序的功能和可靠性。它們解決了導出結果、處理錯誤和避免速率限制等常見挑戰，確保工具高效且用戶友好。

5.1。匯出結果

將輸出儲存到結構化 JSON 檔案以便於使用。

   pip install requests beautifulsoup4

5.2。錯誤處理

為網路錯誤和意外的 HTML 變更添加強大的錯誤處理：

<div role="rowheader">
  <a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>

5.3。速率限制

為了避免受到 GitHub 的速率限制，請引入延遲：

import requests
from bs4 import BeautifulSoup
import time

def crawl_github_folder(url, depth=0, max_depth=3):
    """
    Recursively crawls a GitHub repository folder structure.

    Parameters:
    - url (str): URL of the GitHub folder to scrape.
    - depth (int): Current recursion depth.
    - max_depth (int): Maximum depth to recurse.
    """
    if depth > max_depth:
        return

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url} (Status code: {response.status_code})")
        return

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract folder and file links
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            print(f"{'  ' * depth}Folder: {item_name}")
            crawl_github_folder(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            print(f"{'  ' * depth}File: {item_name}")

# Example usage
if __name__ == "__main__":
    repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
    crawl_github_folder(repo_url)
</folder></branch></repo></owner>

6.道德考量

由軟體自動化和道德程式設計專家 Shpetim Haxhiu 撰寫，本部分確保在使用 GitHub 爬蟲時遵守最佳實踐。

合規性：遵守 GitHub 的服務條款。
最小化負載：透過限制請求和增加延遲來尊重 GitHub 的伺服器。
權限：取得廣泛爬取私有倉庫的權限。

7.完整程式碼

這是包含所有功能的綜合腳本：

import json

def crawl_to_json(url, depth=0, max_depth=3):
    """Crawls and saves results as JSON."""
    result = {}

    if depth > max_depth:
        return result

    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to access {url}")
        return result

    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.select('div[role="rowheader"] a')

    for item in items:
        item_name = item.text.strip()
        item_url = f"https://github.com{item['href']}"

        if '/tree/' in item_url:
            result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
        elif '/blob/' in item_url:
            result[item_name] = "file"

    return result

if __name__ == "__main__":
    repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
    structure = crawl_to_json(repo_url)

    with open("output.json", "w") as file:
        json.dump(structure, file, indent=2)

    print("Repository structure saved to output.json")
</folder></branch></repo></owner>

透過遵循此詳細指南，您可以建立強大的 GitHub 資料夾爬蟲。該工具可以適應各種需求，同時確保道德合規性。

歡迎在留言區留言！另外，別忘了與我聯絡：

電子郵件：shpetim.h@gmail.com
LinkedIn：linkedin.com/in/shpetimhaxhiu
GitHub：github.com/shpetimhaxhiu

以上是詳細教學：不使用 API 爬取 GitHub 儲存庫資料夾的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

我如何使用美麗的湯來解析HTML？Mar 10, 2025 pm 06:54 PM

本文解釋瞭如何使用美麗的湯庫來解析html。它詳細介紹了常見方法，例如find（），find_all（），select（）和get_text（），以用於數據提取，處理不同的HTML結構和錯誤以及替代方案（SEL）

Python中的數學模塊：統計Mar 09, 2025 am 11:40 AM

Python的statistics模塊提供強大的數據統計分析功能，幫助我們快速理解數據整體特徵，例如生物統計學和商業分析等領域。無需逐個查看數據點，只需查看均值或方差等統計量，即可發現原始數據中可能被忽略的趨勢和特徵，並更輕鬆、有效地比較大型數據集。本教程將介紹如何計算平均值和衡量數據集的離散程度。除非另有說明，本模塊中的所有函數都支持使用mean()函數計算平均值，而非簡單的求和平均。也可使用浮點數。 import random import statistics from fracti

python對象的序列化和避難所化：第1部分Mar 08, 2025 am 09:39 AM

Python 對象的序列化和反序列化是任何非平凡程序的關鍵方面。如果您將某些內容保存到 Python 文件中，如果您讀取配置文件，或者如果您響應 HTTP 請求，您都會進行對象序列化和反序列化。從某種意義上說，序列化和反序列化是世界上最無聊的事情。誰會在乎所有這些格式和協議？您想持久化或流式傳輸一些 Python 對象，並在以後完整地取回它們。這是一種在概念層面上看待世界的好方法。但是，在實際層面上，您選擇的序列化方案、格式或協議可能會決定程序運行的速度、安全性、維護狀態的自由度以及與其他系