目前我将从世界各地的教区收集数据。
我的方法适用于 bs4 和 pandas。我目前正在研究抓取逻辑。
import requests from bs4 import BeautifulSoup import pandas as pd url = "http://www.catholic-hierarchy.org/" # Send a GET request to the website response = requests.get(url) #my approach to parse the HTML content of the page soup = BeautifulSoup(response.text, 'html.parser') # Find the relevant elements containing diocese information diocese_elements = soup.find_all("div", class_="diocesan") # Initialize empty lists to store data dioceses = [] addresses = [] # Extract now data from each diocese element for diocese_element in diocese_elements: # Example: Extracting diocese name diocese_name = diocese_element.find("a").text.strip() dioceses.append(diocese_name) # Example: Extracting address address = diocese_element.find("div", class_="address").text.strip() addresses.append(address) # to save the whole data we create a DataFrame using pandas data = {'Diocese': dioceses, 'Address': addresses} df = pd.DataFrame(data) # Display the DataFrame print(df)
目前我的 pycharm 上发现了一些奇怪的东西。 我尝试找到一种使用pandas 方法收集全部数据的方法。
这个示例可以帮助您入门 - 它将解析所有教区页面以获取教区名称 + url,并将其存储到 panda 的 dataframe 中。
然后您可以迭代这些 url 并获取所需的更多信息。
import pandas as pd import requests from bs4 import beautifulsoup chars = "abcdefghijklmnopqrstuvwxyz" url = "http://www.catholic-hierarchy.org/diocese/la{char}.html" all_data = [] for char in chars: u = url.format(char=char) while true: print(f"parsing {u}") soup = beautifulsoup(requests.get(u).content, "html.parser") for a in soup.select("li a[href^=d]"): all_data.append( { "name": a.text, "url": "http://www.catholic-hierarchy.org/diocese/" + a["href"], } ) next_page = soup.select_one('a:has(img[alt="[next page]"])') if not next_page: break u = "http://www.catholic-hierarchy.org/diocese/" + next_page["href"] df = pd.dataframe(all_data).drop_duplicates() print(df.head(10))
打印:
... Parsing http://www.catholic-hierarchy.org/diocese/lax.html Parsing http://www.catholic-hierarchy.org/diocese/lay.html Parsing http://www.catholic-hierarchy.org/diocese/laz.html Name URL 0 Holy See http://www.catholic-hierarchy.org/diocese/droma.html 1 Diocese of Rome http://www.catholic-hierarchy.org/diocese/droma.html 2 Aachen http://www.catholic-hierarchy.org/diocese/da549.html 3 Aachen http://www.catholic-hierarchy.org/diocese/daach.html 4 Aarhus (Århus) http://www.catholic-hierarchy.org/diocese/da566.html 5 Aba http://www.catholic-hierarchy.org/diocese/dabaa.html 6 Abaetetuba http://www.catholic-hierarchy.org/diocese/dabae.html 8 Abakaliki http://www.catholic-hierarchy.org/diocese/dabak.html 9 Abancay http://www.catholic-hierarchy.org/diocese/daban.html 10 Abaradira http://www.catholic-hierarchy.org/diocese/d2a01.html
以上是美丽的汤解析许多条目的列表并保存在数据框中的详细内容。更多信息请关注PHP中文网其他相关文章!