我正在嘗試從幾千個 html 檔案或網站資料中提取表數據,但是這些表沒有 div 來使這變得簡單,而且我對 beautiful soup 還很陌生。現在,我正在手動編輯所有轉換後的 html 到 csv 並將它們放入我的資料庫中以建立表格,但我寧願只抓取我已經擁有的內容。
< <body style="margin-top:140px;"> <div id="container"> <!-- Left div --> <div> </div> <!-- Center div --> <div> <!-- Image Link --> <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a> </div> <!-- Right div --> <div> </div> </div> <A Name = "Top"></A> <H1>5k Run</H1> <H1>Overall Finish List</H1> <H2>September 24, 2022</H2> <HR noshade> <B><I> </I></B> <HR noshade> <table border=0 cellpadding=0 cellspacing=0 class="racetable"> <tr> <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td> </tr> <tr> <td class=h11>Place</td> <td class=h12>Name</td> <td class=h12>City</td> <td class=h11>Bib No</td> <td class=h11>Age</td> <td class=h11>Gender</td> <td class=h11>Age Group</td> <td class=h11>Total Time</td> <td class=h11>Pace</td> </tr> <tr> <td class=d01>1</td> <td class=d02>Runner 1</td> <td class=d02>ANYTOWN PA</td> <td class=d01>390</td> <td class=d01>52</td> <td class=d01>M</td> <td class=d01>1:Overall</td> <td class=d01> 18:43.93</td> <td class=d01>6:03/M</td> </tr> <tr> <td class=d01>2</td> <td class=d02>Runner 2</td> <td class=d02>ANYTOWN PA</td> <td class=d01>380</td> <td class=d01>33</td> <td class=d01>M</td> <td class=d01>1:19-39</td> <td class=d01> 19:31.27</td> <td class=d01>6:18/M</td> </tr> <tr> <td class=d01>3</td> <td class=d02>Runner 3</td> <td class=d02>ANYTOWN PA</td> <td class=d01>389</td> <td class=d01>65</td> <td class=d01>F</td> <td class=d01>1:Overall</td> <td class=d01> 45:45.20</td> <td class=d01>14:46/M</td> </tr> <tr> <td class=d01>4</td> <td class=d02>Runner 4</td> <td class=d02>ANYTOWN PA</td> <td class=d01>381</td> <td class=d01>18</td> <td class=d01>F</td> <td class=d01>1: 1-18</td> <td class=d01> 53:28.84</td> <td class=d01>17:15/M</td> </tr> <tr> <td class=d01>5</td> <td class=d02>Runner 5</td> <td class=d02>ANYTOWN PA</td> <td class=d01>382</td> <td class=d01>41</td> <td class=d01>F</td> <td class=d01>1:40-59</td> <td class=d01> 53:30.48</td> <td class=d01>17:16/M</td> </tr> <tr> <td class=d01>6</td> <td class=d02>Runner 6</td> <td class=d02>ANYTOWN PA</td> <td class=d01>384</td> <td class=d01>14</td> <td class=d01>M</td> <td class=d01>1: 1-18</td> <td class=d01> 57:38.66</td> <td class=d01>18:36/M</td> </tr> <tr> <td class=d01>7</td> <td class=d02>Runner 7</td> <td class=d02>ANYTOWN PA</td> <td class=d01>385</td> <td class=d01>72</td> <td class=d01>F</td> <td class=d01>1:60-99</td> <td class=d01> 57:40.11</td> <td class=d01>18:36/M</td> </tr> </table> <HR noshade> <p> <!-- 0c17 22.0 2e9 --> </BODY> </HTML> >
我嘗試過新增 div,但沒有取得太大成功。
P粉4632912482024-02-27 00:58:21
BeautifulSoup 可讓您搜尋 div 以外的內容。
假設您顯示的 html 想要檢索看起來像跑步者的內容,您可以執行類似的操作。
from bs4 import BeautifulSoup
file_path = 'scrap.html'
with open(file_path, 'r',
encoding='utf-8') as file: # We simulate a return from an html request by just opening an .html file
html_content = file.read()
soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"}) # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:] # All lines in the table without the first one
columns_name = [
row.get_text() for row in rows_table[0].find_all('td')
] # We get the name of each column in a list
runners = []
for row in rows_table[1:]: # We repeat on all the lines except the first one which is the one with the name of the columns
data = [
elem.get_text().strip() for elem in row.find_all('td')
]
runner = {
"place": data[columns_name.index("Place")],
"name": data[columns_name.index("Name")],
"city": data[columns_name.index("City")],
"bib_no": data[columns_name.index("Bib No")],
"age": data[columns_name.index("Age")],
"gender": data[columns_name.index("Gender")],
"age_group": data[columns_name.index("Age Group")],
"total_time": data[columns_name.index("Total Time")],
"pace": data[columns_name.index("Pace")]
}
print(runner)
runners.append(runner)
列印的結果看起來像這樣
{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'} {'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN PA', 'bib_no': '380', 'age': '33', 'gender': 'M' , 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'} {'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN PA', 'bib_no': '389', 'age': '65', 'gender': 'F' , 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'} {'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN PA', 'bib_no': '381', 'age': '18', 'gender': 'F' , 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'} {'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN PA', 'bib_no': '382', 'age': '41', 'gender': 'F' , 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'} {'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN PA', 'bib_no': '384', 'age': '14', 'gender': 'M' , 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'} {'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN PA', 'bib_no': '385', 'age': '72', 'gender': 'F' , 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}