首页 >后端开发 >Python教程 >美丽的汤解析许多条目的列表并保存在数据框中

美丽的汤解析许多条目的列表并保存在数据框中

WBOY
WBOY转载
2024-02-10 08:48:03823浏览

美丽的汤解析许多条目的列表并保存在数据框中

问题内容

目前我将从世界各地的教区收集数据。

我的方法适用于 bs4 和 pandas。我目前正在研究抓取逻辑。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "http://www.catholic-hierarchy.org/"

# Send a GET request to the website
response = requests.get(url)

#my approach  to parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')

# Find the relevant elements containing diocese information
diocese_elements = soup.find_all("div", class_="diocesan")

# Initialize empty lists to store data
dioceses = []
addresses = []

# Extract now data from each diocese element
for diocese_element in diocese_elements:
    # Example: Extracting diocese name
    diocese_name = diocese_element.find("a").text.strip()
    dioceses.append(diocese_name)

    # Example: Extracting address
    address = diocese_element.find("div", class_="address").text.strip()
    addresses.append(address)

#  to save the whole data we create a DataFrame using pandas
data = {'Diocese': dioceses, 'Address': addresses}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

目前我的 pycharm 上发现了一些奇怪的东西。 我尝试找到一种使用pandas 方法收集全部数据的方法。


正确答案


这个示例可以帮助您入门 - 它将解析所有教区页面以获取教区名称 + url,并将其存储到 panda 的 dataframe 中。

然后您可以迭代这些 url 并获取所需的更多信息。

import pandas as pd
import requests
from bs4 import beautifulsoup

chars = "abcdefghijklmnopqrstuvwxyz"
url = "http://www.catholic-hierarchy.org/diocese/la{char}.html"

all_data = []
for char in chars:
    u = url.format(char=char)

    while true:
        print(f"parsing {u}")
        soup = beautifulsoup(requests.get(u).content, "html.parser")
        for a in soup.select("li a[href^=d]"):
            all_data.append(
                {
                    "name": a.text,
                    "url": "http://www.catholic-hierarchy.org/diocese/" + a["href"],
                }
            )

        next_page = soup.select_one('a:has(img[alt="[next page]"])')
        if not next_page:
            break

        u = "http://www.catholic-hierarchy.org/diocese/" + next_page["href"]

df = pd.dataframe(all_data).drop_duplicates()
print(df.head(10))

打印:

...
Parsing http://www.catholic-hierarchy.org/diocese/lax.html
Parsing http://www.catholic-hierarchy.org/diocese/lay.html
Parsing http://www.catholic-hierarchy.org/diocese/laz.html

               Name                                                   URL
0          Holy See  http://www.catholic-hierarchy.org/diocese/droma.html
1   Diocese of Rome  http://www.catholic-hierarchy.org/diocese/droma.html
2            Aachen  http://www.catholic-hierarchy.org/diocese/da549.html
3            Aachen  http://www.catholic-hierarchy.org/diocese/daach.html
4    Aarhus (Århus)  http://www.catholic-hierarchy.org/diocese/da566.html
5               Aba  http://www.catholic-hierarchy.org/diocese/dabaa.html
6        Abaetetuba  http://www.catholic-hierarchy.org/diocese/dabae.html
8         Abakaliki  http://www.catholic-hierarchy.org/diocese/dabak.html
9           Abancay  http://www.catholic-hierarchy.org/diocese/daban.html
10        Abaradira  http://www.catholic-hierarchy.org/diocese/d2a01.html

以上是美丽的汤解析许多条目的列表并保存在数据框中的详细内容。更多信息请关注PHP中文网其他相关文章!

声明:
本文转载于:stackoverflow.com。如有侵权,请联系admin@php.cn删除