搜尋

首頁  >  問答  >  主體

當 div 不存在時使用 Beautiful soup 提取數據

我正在嘗試從幾千個 html 檔案或網站資料中提取表數據,但是這些表沒有 div 來使這變得簡單,而且我對 beautiful soup 還很陌生。現在,我正在手動編輯所有轉換後的 html 到 csv 並將它們放入我的資料庫中以建立表格,但我寧願只抓取我已經擁有的內容。

<
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
 <div>
  &nbsp;
 </div>
 <!-- Center div -->
 <div>
  <!-- Image Link -->
  <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a>
 </div>
 <!-- Right div -->
 <div>
  &nbsp;
 </div>
</div>
<A Name = "Top"></A>
<H1>5k Run</H1>
<H1>Overall Finish List</H1>
<H2>September 24, 2022</H2>
<HR noshade>
<B><I> </I></B>
<HR noshade>
<table border=0 cellpadding=0 cellspacing=0 class="racetable">
  <tr>
    <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td>
  </tr>
  <tr>
    <td class=h11>Place</td>
    <td class=h12>Name</td>
    <td class=h12>City</td>
    <td class=h11>Bib No</td>
    <td class=h11>Age</td>
    <td class=h11>Gender</td>
    <td class=h11>Age Group</td>
    <td class=h11>Total Time</td>
    <td class=h11>Pace</td>
  </tr>
  <tr>
    <td class=d01>1</td>
    <td class=d02>Runner 1</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>390</td>
    <td class=d01>52</td>
    <td class=d01>M</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   18:43.93</td>
    <td class=d01>6:03/M</td>
  </tr>
  <tr>
    <td class=d01>2</td>
    <td class=d02>Runner 2</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>380</td>
    <td class=d01>33</td>
    <td class=d01>M</td>
    <td class=d01>1:19-39</td>
    <td class=d01>   19:31.27</td>
    <td class=d01>6:18/M</td>
  </tr>
  <tr>
    <td class=d01>3</td>
    <td class=d02>Runner 3</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>389</td>
    <td class=d01>65</td>
    <td class=d01>F</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   45:45.20</td>
    <td class=d01>14:46/M</td>
  </tr>
  <tr>
    <td class=d01>4</td>
    <td class=d02>Runner 4</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>381</td>
    <td class=d01>18</td>
    <td class=d01>F</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   53:28.84</td>
    <td class=d01>17:15/M</td>
  </tr>
  <tr>
    <td class=d01>5</td>
    <td class=d02>Runner 5</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>382</td>
    <td class=d01>41</td>
    <td class=d01>F</td>
    <td class=d01>1:40-59</td>
    <td class=d01>   53:30.48</td>
    <td class=d01>17:16/M</td>
  </tr>
  <tr>
    <td class=d01>6</td>
    <td class=d02>Runner 6</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>384</td>
    <td class=d01>14</td>
    <td class=d01>M</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   57:38.66</td>
    <td class=d01>18:36/M</td>
  </tr>
  <tr>
    <td class=d01>7</td>
    <td class=d02>Runner 7</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>385</td>
    <td class=d01>72</td>
    <td class=d01>F</td>
    <td class=d01>1:60-99</td>
    <td class=d01>   57:40.11</td>
    <td class=d01>18:36/M</td>
  </tr>
</table>
 
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
>

我嘗試過新增 div,但沒有取得太大成功。

P粉818306280P粉818306280330 天前484

全部回覆(1)我來回復

  • P粉463291248

    P粉4632912482024-02-27 00:58:21

    BeautifulSoup 可讓您搜尋 div 以外的內容。

    假設您顯示的 html 想要檢索看起來像跑步者的內容,您可以執行類似的操作。

    from bs4 import BeautifulSoup
    
    file_path = 'scrap.html'
    
    with open(file_path, 'r',
              encoding='utf-8') as file: # We simulate a return from an html request by just opening an .html file
        html_content = file.read()
    
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find('table', {"class": "racetable"}) # We are looking for the table with the 'racetable' class
    rows_table = table.find_all('tr')[1:] # All lines in the table without the first one
    
    columns_name = [
        row.get_text() for row in rows_table[0].find_all('td')
    ] # We get the name of each column in a list
    
    runners = []
    for row in rows_table[1:]: # We repeat on all the lines except the first one which is the one with the name of the columns
        data = [
            elem.get_text().strip() for elem in row.find_all('td')
        ]
        runner = {
            "place": data[columns_name.index("Place")],
            "name": data[columns_name.index("Name")],
            "city": data[columns_name.index("City")],
            "bib_no": data[columns_name.index("Bib No")],
            "age": data[columns_name.index("Age")],
            "gender": data[columns_name.index("Gender")],
            "age_group": data[columns_name.index("Age Group")],
            "total_time": data[columns_name.index("Total Time")],
            "pace": data[columns_name.index("Pace")]
        }
        print(runner)
        runners.append(runner)
    

    列印的結果看起來像這樣

    {'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
    {'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN PA', 'bib_no': '380', 'age': '33', 'gender': 'M' , 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
    {'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN PA', 'bib_no': '389', 'age': '65', 'gender': 'F' , 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
    {'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN PA', 'bib_no': '381', 'age': '18', 'gender': 'F' , 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
    {'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN PA', 'bib_no': '382', 'age': '41', 'gender': 'F' , 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
    {'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN PA', 'bib_no': '384', 'age': '14', 'gender': 'M' , 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
    {'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN PA', 'bib_no': '385', 'age': '72', 'gender': 'F' , 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}

    回覆
    0
  • 取消回覆