P粉6626142132023-08-02 10:42:21
將帶有巢狀表的HTML檔案轉換為CSV,同時保留結構可能有點困難。 BeautifulSoup是解析HTML的一個很好的函式庫,但它可能需要額外的操作才能正確處理巢狀表。
為了獲得所需的輸出,可以使用BeautifulSoup和一些自訂Python程式碼來解析HTML、提取資料並將其正確組織為CSV格式。這裡有一個循序漸進的方法來幫助你實現這個目標:
使用BeautifulSoup解析HTML檔。
下面是一段Python程式碼片段來幫助你入門:
from bs4 import BeautifulSoup import csv def extract_nested_table_data(table_cell): # Helper function to extract the data from a nested table cell nested_table = table_cell.find('table') if not nested_table: return '' # Process the nested table and extract its data as plain text nested_rows = nested_table.find_all('tr') nested_data = [] for row in nested_rows: nested_cells = row.find_all(['td', 'th']) nested_data.append([cell.get_text(strip=True) for cell in nested_cells]) # Convert nested_data to a formatted plain text representation nested_text = '\n'.join(','.join(row) for row in nested_data) return nested_text def convert_html_to_csv(html_filename, csv_filename): with open(html_filename, 'r', encoding='utf-8') as html_file: soup = BeautifulSoup(html_file, 'html.parser') parent_table = soup.find('table') headers = [header.get_text(strip=True) for header in parent_table.find_all('th')] with open(csv_filename, 'w', newline='', encoding='utf-8') as csv_file: csv_writer = csv.writer(csv_file) csv_writer.writerow(headers) rows = parent_table.find_all('tr') for row in rows[1:]: # Skipping the header row cells = row.find_all(['td', 'th']) row_data = [cell.get_text(strip=True) for cell in cells] # Extract data from nested table (if it exists) and append to the row for idx, cell in enumerate(cells): nested_data = extract_nested_table_data(cell) row_data[idx] += nested_data csv_writer.writerow(row_data) if __name__ == '__main__': html_filename = 'input.html' csv_filename = 'output.csv' convert_html_to_csv(html_filename, csv_filename)
This code assumes that your nested table data is comma-separated. If it's not, you may need to adjust the separator accordingly. Additionally, consider other delimiters if your sested table contains thatother delimiters#if your nested table contains that#. complex HTML structures may require further adjustments to this code, depending on the specifics of your data. Nonetheless, this should serve as a good starting point to tackle the task.