tag. For example:
<div role="rowheader">
<a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>
3. Implementing the Scraper
3.1. Recursive Crawling Function
The script will recursively scrape folders and print their structure. To limit the recursion depth and avoid unnecessary load, we’ll use a depth parameter.
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
crawl_github_folder(repo_url)
4. Features Explained
-
Headers for Request: Using a User-Agent string to mimic a browser and avoid blocking.
-
Recursive Crawling:
- Detects folders (/tree/) and recursively enters them.
- Lists files (/blob/) without entering further.
-
Indentation: Reflects folder hierarchy in the output.
-
Depth Limitation: Prevents excessive recursion by setting a maximum depth (max_depth).
5. Enhancements
These enhancements are designed to improve the functionality and reliability of the crawler. They address common challenges like exporting results, handling errors, and avoiding rate limits, ensuring the tool is efficient and user-friendly.
5.1. Exporting Results
Save the output to a structured JSON file for easier usage.
pip install requests beautifulsoup4
5.2. Error Handling
Add robust error handling for network errors and unexpected HTML changes:
<div role="rowheader">
<a href="/owner/repo/tree/main/folder-name">folder-name</a>
</div>
5.3. Rate Limiting
To avoid being rate-limited by GitHub, introduce delays:
import requests
from bs4 import BeautifulSoup
import time
def crawl_github_folder(url, depth=0, max_depth=3):
"""
Recursively crawls a GitHub repository folder structure.
Parameters:
- url (str): URL of the GitHub folder to scrape.
- depth (int): Current recursion depth.
- max_depth (int): Maximum depth to recurse.
"""
if depth > max_depth:
return
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url} (Status code: {response.status_code})")
return
soup = BeautifulSoup(response.text, 'html.parser')
# Extract folder and file links
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
print(f"{' ' * depth}Folder: {item_name}")
crawl_github_folder(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
print(f"{' ' * depth}File: {item_name}")
# Example usage
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
crawl_github_folder(repo_url)
6. Ethical Considerations
Authored by Shpetim Haxhiu, an expert in software automation and ethical programming, this section ensures adherence to best practices while using the GitHub crawler.
-
Compliance: Adhere to GitHub’s Terms of Service.
-
Minimize Load: Respect GitHub’s servers by limiting requests and adding delays.
-
Permission: Obtain permission for extensive crawling of private repositories.
7. Complete Code
Here’s the consolidated script with all features included:
import json
def crawl_to_json(url, depth=0, max_depth=3):
"""Crawls and saves results as JSON."""
result = {}
if depth > max_depth:
return result
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f"Failed to access {url}")
return result
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('div[role="rowheader"] a')
for item in items:
item_name = item.text.strip()
item_url = f"https://github.com{item['href']}"
if '/tree/' in item_url:
result[item_name] = crawl_to_json(item_url, depth + 1, max_depth)
elif '/blob/' in item_url:
result[item_name] = "file"
return result
if __name__ == "__main__":
repo_url = "https://github.com/<owner>/<repo>/tree/<branch>/<folder>"
structure = crawl_to_json(repo_url)
with open("output.json", "w") as file:
json.dump(structure, file, indent=2)
print("Repository structure saved to output.json")
By following this detailed guide, you can build a robust GitHub folder crawler. This tool can be adapted for various needs while ensuring ethical compliance.
Feel free to leave questions in the comments section! Also, don’t forget to connect with me:
-
Email: shpetim.h@gmail.com
-
LinkedIn: linkedin.com/in/shpetimhaxhiu
-
GitHub: github.com/shpetimhaxhiu
The above is the detailed content of Detailed Tutorial: Crawling GitHub Repository Folders Without API. For more information, please follow other related articles on the PHP Chinese website!