I'm trying to extract table data from a few thousand html files or site data, but the tables don't have divs to make this easy, and I'm new to beautiful soup. Right now I'm manually editing all the converted html to csv and putting them into my database to create the tables, but I'd rather just grab what I already have.
< <body style="margin-top:140px;"> <div id="container"> <!-- Left div --> <div> </div> <!-- Center div --> <div> <!-- Image Link --> <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a> </div> <!-- Right div --> <div> </div> </div> <A Name = "Top"></A> <H1>5k Run</H1> <H1>Overall Finish List</H1> <H2>September 24, 2022</H2> <HR noshade> <B><I> </I></B> <HR noshade> <table border=0 cellpadding=0 cellspacing=0 class="racetable"> <tr> <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td> </tr> <tr> <td class=h11>Place</td> <td class=h12>Name</td> <td class=h12>City</td> <td class=h11>Bib No</td> <td class=h11>Age</td> <td class=h11>Gender</td> <td class=h11>Age Group</td> <td class=h11>Total Time</td> <td class=h11>Pace</td> </tr> <tr> <td class=d01>1</td> <td class=d02>Runner 1</td> <td class=d02>ANYTOWN PA</td> <td class=d01>390</td> <td class=d01>52</td> <td class=d01>M</td> <td class=d01>1:Overall</td> <td class=d01> 18:43.93</td> <td class=d01>6:03/M</td> </tr> <tr> <td class=d01>2</td> <td class=d02>Runner 2</td> <td class=d02>ANYTOWN PA</td> <td class=d01>380</td> <td class=d01>33</td> <td class=d01>M</td> <td class=d01>1:19-39</td> <td class=d01> 19:31.27</td> <td class=d01>6:18/M</td> </tr> <tr> <td class=d01>3</td> <td class=d02>Runner 3</td> <td class=d02>ANYTOWN PA</td> <td class=d01>389</td> <td class=d01>65</td> <td class=d01>F</td> <td class=d01>1:Overall</td> <td class=d01> 45:45.20</td> <td class=d01>14:46/M</td> </tr> <tr> <td class=d01>4</td> <td class=d02>Runner 4</td> <td class=d02>ANYTOWN PA</td> <td class=d01>381</td> <td class=d01>18</td> <td class=d01>F</td> <td class=d01>1: 1-18</td> <td class=d01> 53:28.84</td> <td class=d01>17:15/M</td> </tr> <tr> <td class=d01>5</td> <td class=d02>Runner 5</td> <td class=d02>ANYTOWN PA</td> <td class=d01>382</td> <td class=d01>41</td> <td class=d01>F</td> <td class=d01>1:40-59</td> <td class=d01> 53:30.48</td> <td class=d01>17:16/M</td> </tr> <tr> <td class=d01>6</td> <td class=d02>Runner 6</td> <td class=d02>ANYTOWN PA</td> <td class=d01>384</td> <td class=d01>14</td> <td class=d01>M</td> <td class=d01>1: 1-18</td> <td class=d01> 57:38.66</td> <td class=d01>18:36/M</td> </tr> <tr> <td class=d01>7</td> <td class=d02>Runner 7</td> <td class=d02>ANYTOWN PA</td> <td class=d01>385</td> <td class=d01>72</td> <td class=d01>F</td> <td class=d01>1:60-99</td> <td class=d01> 57:40.11</td> <td class=d01>18:36/M</td> </tr> </table> <HR noshade> <p> <!-- 0c17 22.0 2e9 --> </BODY> </HTML> >
I've tried adding divs without much success.
P粉4632912482024-02-27 00:58:21
BeautifulSoup allows you to search for content outside of divs.
Assuming the html you are displaying wants to retrieve something that looks like a runner, you could do something like this.
from bs4 import BeautifulSoup
file_path = 'scrap.html'
with open(file_path, 'r',
encoding='utf-8') as file: # We simulate a return from an html request by just opening an .html file
html_content = file.read()
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"}) # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:] # All lines in the table without the first one
columns_name = [
row.get_text() for row in rows_table[0].find_all('td')
] # We get the name of each column in a list
runners = []
for row in rows_table[1:]: # We repeat on all the lines except the first one which is the one with the name of the columns
data = [
elem.get_text().strip() for elem in row.find_all('td')
]
runner = {
"place": data[columns_name.index("Place")],
"name": data[columns_name.index("Name")],
"city": data[columns_name.index("City")],
"bib_no": data[columns_name.index("Bib No")],
"age": data[columns_name.index("Age")],
"gender": data[columns_name.index("Gender")],
"age_group": data[columns_name.index("Age Group")],
"total_time": data[columns_name.index("Total Time")],
"pace": data[columns_name.index("Pace")]
}
print(runner)
runners.append(runner)
The printed result looks like this
{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'} {'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN PA', 'bib_no': '380', 'age': '33', 'gender': 'M' , 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'} {'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN PA', 'bib_no': '389', 'age': '65', 'gender': 'F' , 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'} {'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN PA', 'bib_no': '381', 'age': '18', 'gender': 'F' , 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'} {'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN PA', 'bib_no': '382', 'age': '41', 'gender': 'F' , 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'} {'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN PA', 'bib_no': '384', 'age': '14', 'gender': 'M' , 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'} {'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN PA', 'bib_no': '385', 'age': '72', 'gender': 'F' , 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}