cari

Rumah  >  Soal Jawab  >  teks badan

Gunakan sup Cantik untuk mengekstrak data apabila div tidak wujud

Saya cuba mengekstrak data jadual daripada beberapa ribu fail html atau data tapak, tetapi jadual tidak mempunyai div untuk memudahkan ini, dan saya baru mengenali sup yang cantik. Sekarang saya sedang mengedit secara manual semua html yang ditukar kepada csv dan meletakkannya ke dalam pangkalan data saya untuk mencipta jadual, tetapi saya lebih suka mengambil apa yang saya sudah ada.

<
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
 <div>
  &nbsp;
 </div>
 <!-- Center div -->
 <div>
  <!-- Image Link -->
  <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a>
 </div>
 <!-- Right div -->
 <div>
  &nbsp;
 </div>
</div>
<A Name = "Top"></A>
<H1>5k Run</H1>
<H1>Overall Finish List</H1>
<H2>September 24, 2022</H2>
<HR noshade>
<B><I> </I></B>
<HR noshade>
<table border=0 cellpadding=0 cellspacing=0 class="racetable">
  <tr>
    <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td>
  </tr>
  <tr>
    <td class=h11>Place</td>
    <td class=h12>Name</td>
    <td class=h12>City</td>
    <td class=h11>Bib No</td>
    <td class=h11>Age</td>
    <td class=h11>Gender</td>
    <td class=h11>Age Group</td>
    <td class=h11>Total Time</td>
    <td class=h11>Pace</td>
  </tr>
  <tr>
    <td class=d01>1</td>
    <td class=d02>Runner 1</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>390</td>
    <td class=d01>52</td>
    <td class=d01>M</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   18:43.93</td>
    <td class=d01>6:03/M</td>
  </tr>
  <tr>
    <td class=d01>2</td>
    <td class=d02>Runner 2</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>380</td>
    <td class=d01>33</td>
    <td class=d01>M</td>
    <td class=d01>1:19-39</td>
    <td class=d01>   19:31.27</td>
    <td class=d01>6:18/M</td>
  </tr>
  <tr>
    <td class=d01>3</td>
    <td class=d02>Runner 3</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>389</td>
    <td class=d01>65</td>
    <td class=d01>F</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   45:45.20</td>
    <td class=d01>14:46/M</td>
  </tr>
  <tr>
    <td class=d01>4</td>
    <td class=d02>Runner 4</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>381</td>
    <td class=d01>18</td>
    <td class=d01>F</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   53:28.84</td>
    <td class=d01>17:15/M</td>
  </tr>
  <tr>
    <td class=d01>5</td>
    <td class=d02>Runner 5</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>382</td>
    <td class=d01>41</td>
    <td class=d01>F</td>
    <td class=d01>1:40-59</td>
    <td class=d01>   53:30.48</td>
    <td class=d01>17:16/M</td>
  </tr>
  <tr>
    <td class=d01>6</td>
    <td class=d02>Runner 6</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>384</td>
    <td class=d01>14</td>
    <td class=d01>M</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   57:38.66</td>
    <td class=d01>18:36/M</td>
  </tr>
  <tr>
    <td class=d01>7</td>
    <td class=d02>Runner 7</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>385</td>
    <td class=d01>72</td>
    <td class=d01>F</td>
    <td class=d01>1:60-99</td>
    <td class=d01>   57:40.11</td>
    <td class=d01>18:36/M</td>
  </tr>
</table>
 
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
>

Saya telah mencuba menambah div tanpa banyak kejayaan.

P粉818306280P粉818306280341 hari yang lalu496

membalas semua(1)saya akan balas

  • P粉463291248

    P粉4632912482024-02-27 00:58:21

    BeautifulSoup membolehkan anda mencari di luar div.

    Andaikan html yang anda paparkan ingin mendapatkan semula sesuatu yang kelihatan seperti pelari, anda boleh melakukan sesuatu seperti ini.

    from bs4 import BeautifulSoup
    
    file_path = 'scrap.html'
    
    with open(file_path, 'r',
              encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
        html_content = file.read()
    
    soup = BeautifulSoup(html_content, 'html.parser')
    table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
    rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one
    
    columns_name = [
        row.get_text() for row in rows_table[0].find_all('td')
    ]  # We get the name of each column in a list
    
    runners = []
    for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
        data = [
            elem.get_text().strip() for elem in row.find_all('td')
        ]
        runner = {
            "place": data[columns_name.index("Place")],
            "name": data[columns_name.index("Name")],
            "city": data[columns_name.index("City")],
            "bib_no": data[columns_name.index("Bib No")],
            "age": data[columns_name.index("Age")],
            "gender": data[columns_name.index("Gender")],
            "age_group": data[columns_name.index("Age Group")],
            "total_time": data[columns_name.index("Total Time")],
            "pace": data[columns_name.index("Pace")]
        }
        print(runner)
        runners.append(runner)
    

    Hasil cetakan kelihatan seperti ini

    {'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
    {'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
    {'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
    {'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
    {'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
    {'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
    {'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}

    balas
    0
  • Batalbalas