Gunakan sup Cantik untuk mengekstrak data apabila div tidak wujud

Question

Saya cuba mengekstrak data jadual daripada beribu-ribu fail html atau data tapak, tetapi jadual tidak mempunyai div untuk memudahkannya, dan saya sangat baru dengan beautifulsoup. Sekarang saya sedang mengedit secara manual semua html yang ditukar kepada csv dan memasukkannya ke dalam pangkalan data saya untuk mencipta jadual, tetapi saya lebih suka mengambil apa yang saya sudah ada. <

P粉463291248 · Answer

BeautifulSoup membolehkan anda mencari di luar div.

Andaikan html yang anda paparkan ingin mendapatkan semula sesuatu yang kelihatan seperti pelari, anda boleh melakukan sesuatu seperti ini.

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
]  # We get the name of each column in a list

runners = []
for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

Hasil cetakan kelihatan seperti ini

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}

Gunakan sup Cantik untuk mengekstrak data apabila div tidak wujud

membalas semua(1)saya akan balas