Use Beautiful soup to extract data when the div does not exist

Question

I'm trying to extract table data from thousands of html files or site data, but the tables don't have divs to make this easy, and I'm very new to beautifulsoup. Right now I'm manually editing all the converted html to csv and putting them into my database to create the tables, but I'd rather just grab what I already have. <

P粉463291248 · Answer

BeautifulSoup allows you to search for content outside of divs.

Assuming the html you are displaying wants to retrieve something that looks like a runner, you could do something like this.

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file: # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"}) # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:] # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
] # We get the name of each column in a list

runners = []
for row in rows_table[1:]: # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

The printed result looks like this

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN PA', 'bib_no': '380', 'age': '33', 'gender': 'M' , 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN PA', 'bib_no': '389', 'age': '65', 'gender': 'F' , 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN PA', 'bib_no': '381', 'age': '18', 'gender': 'F' , 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN PA', 'bib_no': '382', 'age': '41', 'gender': 'F' , 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN PA', 'bib_no': '384', 'age': '14', 'gender': 'M' , 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN PA', 'bib_no': '385', 'age': '72', 'gender': 'F' , 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}

Use Beautiful soup to extract data when the div does not exist

reply all(1)I'll reply