search

Home  >  Q&A  >  body text

Use Beautiful soup to extract data when the div does not exist

I'm trying to extract table data from a few thousand html files or site data, but the tables don't have divs to make this easy, and I'm new to beautiful soup. Right now I'm manually editing all the converted html to csv and putting them into my database to create the tables, but I'd rather just grab what I already have.

<
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
 <div>
  &nbsp;
 </div>
 <!-- Center div -->
 <div>
  <!-- Image Link -->
  <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a>
 </div>
 <!-- Right div -->
 <div>
  &nbsp;
 </div>
</div>
<A Name = "Top"></A>
<H1>5k Run</H1>
<H1>Overall Finish List</H1>
<H2>September 24, 2022</H2>
<HR noshade>
<B><I> </I></B>
<HR noshade>
<table border=0 cellpadding=0 cellspacing=0 class="racetable">
  <tr>
    <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td>
  </tr>
  <tr>
    <td class=h11>Place</td>
    <td class=h12>Name</td>
    <td class=h12>City</td>
    <td class=h11>Bib No</td>
    <td class=h11>Age</td>
    <td class=h11>Gender</td>
    <td class=h11>Age Group</td>
    <td class=h11>Total Time</td>
    <td class=h11>Pace</td>
  </tr>
  <tr>
    <td class=d01>1</td>
    <td class=d02>Runner 1</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>390</td>
    <td class=d01>52</td>
    <td class=d01>M</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   18:43.93</td>
    <td class=d01>6:03/M</td>
  </tr>
  <tr>
    <td class=d01>2</td>
    <td class=d02>Runner 2</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>380</td>
    <td class=d01>33</td>
    <td class=d01>M</td>
    <td class=d01>1:19-39</td>
    <td class=d01>   19:31.27</td>
    <td class=d01>6:18/M</td>
  </tr>
  <tr>
    <td class=d01>3</td>
    <td class=d02>Runner 3</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>389</td>
    <td class=d01>65</td>
    <td class=d01>F</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   45:45.20</td>
    <td class=d01>14:46/M</td>
  </tr>
  <tr>
    <td class=d01>4</td>
    <td class=d02>Runner 4</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>381</td>
    <td class=d01>18</td>
    <td class=d01>F</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   53:28.84</td>
    <td class=d01>17:15/M</td>
  </tr>
  <tr>
    <td class=d01>5</td>
    <td class=d02>Runner 5</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>382</td>
    <td class=d01>41</td>
    <td class=d01>F</td>
    <td class=d01>1:40-59</td>
    <td class=d01>   53:30.48</td>
    <td class=d01>17:16/M</td>
  </tr>
  <tr>
    <td class=d01>6</td>
    <td class=d02>Runner 6</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>384</td>
    <td class=d01>14</td>
    <td class=d01>M</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   57:38.66</td>
    <td class=d01>18:36/M</td>
  </tr>
  <tr>
    <td class=d01>7</td>
    <td class=d02>Runner 7</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>385</td>
    <td class=d01>72</td>
    <td class=d01>F</td>
    <td class=d01>1:60-99</td>
    <td class=d01>   57:40.11</td>
    <td class=d01>18:36/M</td>
  </tr>
</table>
 
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
>

I've tried adding divs without much success.

P粉818306280P粉818306280277 days ago445

reply all(1)I'll reply

  • P粉463291248

    P粉4632912482024-02-27 00:58:21

    BeautifulSoup allows you to search for content outside of divs.

    Assuming the html you are displaying wants to retrieve something that looks like a runner, you could do something like this.

    from bs4 import BeautifulSoup
    
    file_path = 'scrap.html'
    
    with open(file_path, 'r',
              encoding='utf-8') as file: # We simulate a return from an html request by just opening an .html file
        html_content = file.read()
    
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find('table', {"class": "racetable"}) # We are looking for the table with the 'racetable' class
    rows_table = table.find_all('tr')[1:] # All lines in the table without the first one
    
    columns_name = [
        row.get_text() for row in rows_table[0].find_all('td')
    ] # We get the name of each column in a list
    
    runners = []
    for row in rows_table[1:]: # We repeat on all the lines except the first one which is the one with the name of the columns
        data = [
            elem.get_text().strip() for elem in row.find_all('td')
        ]
        runner = {
            "place": data[columns_name.index("Place")],
            "name": data[columns_name.index("Name")],
            "city": data[columns_name.index("City")],
            "bib_no": data[columns_name.index("Bib No")],
            "age": data[columns_name.index("Age")],
            "gender": data[columns_name.index("Gender")],
            "age_group": data[columns_name.index("Age Group")],
            "total_time": data[columns_name.index("Total Time")],
            "pace": data[columns_name.index("Pace")]
        }
        print(runner)
        runners.append(runner)
    

    The printed result looks like this

    {'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
    {'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN PA', 'bib_no': '380', 'age': '33', 'gender': 'M' , 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
    {'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN PA', 'bib_no': '389', 'age': '65', 'gender': 'F' , 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
    {'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN PA', 'bib_no': '381', 'age': '18', 'gender': 'F' , 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
    {'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN PA', 'bib_no': '382', 'age': '41', 'gender': 'F' , 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
    {'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN PA', 'bib_no': '384', 'age': '14', 'gender': 'M' , 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
    {'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN PA', 'bib_no': '385', 'age': '72', 'gender': 'F' , 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}

    reply
    0
  • Cancelreply