Home  >  Article  >  Backend Development  >  How does Python process Excel files?

How does Python process Excel files?

WBOY
WBOYforward
2023-05-08 17:58:151645browse

『Problem Description』

The excel to be processed this time has two sheets, and the value of the other sheet needs to be calculated based on the data of one sheet. The problem is that the sheet to be calculated contains not only numerical values, but also formulas. Let’s take a look:

How does Python process Excel files?

As shown in the picture above, this excel has a total of two sheets: CP and DS. We must follow certain business rules and based on the data in CP Calculate the data of the corresponding cell of DS. The blue boxes in the picture contain formulas, while other areas contain numerical values.

Let's take a look, if we follow the processing logic mentioned before, read excel into the dataframe in batches at one time, and then write it back in batches at once, what are the problems? This part of the code is as follows:

import pandas as pd
import xlwings as xw
 
#要处理的文件路径
fpath = "data/DS_format.xlsm"
 
#把CP和DS两个sheet的数据分别读入pandas的dataframe
cp_df = pd.read_excel(fpath,sheet_name="CP",header=[0])
ds_df = pd.read_excel(fpath,sheet_name="DS",header=[0,1])
 
#计算过程省略......
 
#保存结果到excel       
app = xw.App(visible=False,add_book=False)
ds_format_workbook = app.books.open(fpath)
ds_worksheet = ds_format_workbook.sheets["DS"]
ds_worksheet.range("A1").expand().options(index=False).value = ds_df 
ds_format_workbook.save()
ds_format_workbook.close()
app.quit()

The problem with the above code is that when the pd.read_excel() method reads data from excel to the dataframe, for cells with formulas, the formulas are directly read and calculated. The result (if there is no result, Nan is returned), and when we write to excel, we directly write the dataframe back in batches at once, so that the cells with formulas before are written back with the calculated value or Nan, and are discarded. formula.

Okay, a problem has arisen, how should we solve it? Two ideas come to mind here:

  • When dataframe is written back to excel, do not write it back in batches at once, but write back only the calculated data through iteration of rows and columns. The cells with formulas do not move;

  • When reading excel, is there a way to read formulas for cells with formulas instead of reading the results of formula calculations?

I did try the above two ideas respectively. Let’s take a look.

「Option 1」

The following code attempts to traverse the dataframe and then write the corresponding value according to the cell. The cells with formulas will not move

#根据ds_df来写excel,只写该写的单元格
for row_idx,row in ds_df.iterrows():
    total_capabity_val = row[('Total','Capabity')].strip()
    total_capabity1_val = row[('Total','Capabity.1')].strip()
    #Total和1Gb  Eqv.所在的行不写
    if total_capabity_val!= 'Total' and total_capabity_val != '1Gb  Eqv.':
        #给Delta和LOI赋值
        if total_capabity1_val == 'LOI' or total_capabity1_val == 'Delta':
            ds_worksheet.range((row_idx + 3 ,3)).value = row[('Current week','BOH')]
            print(f"ds_sheet的第{row_idx + 3}行第3列被设置为{row[('Current week','BOH')]}") 
        #给Demand和Supply赋值
        if total_capabity1_val == 'Demand' or total_capabity1_val == 'Supply':
            cp_datetime_columns = cp_df.columns[53:]
            for col_idx in range(4,len(ds_df.columns)):
                ds_datetime = ds_df.columns.get_level_values(1)[col_idx]
                ds_month = ds_df.columns.get_level_values(0)[col_idx]
                if type(ds_datetime) == str and ds_datetime != 'TTL' and ds_datetime != 'Total' and (ds_datetime in cp_datetime_columns):
                    ds_worksheet.range((row_idx + 3,col_idx + 1)).value = row[(f'{ds_month}',f'{ds_datetime}')]
                    print(f"ds_sheet的第{row_idx + 3}行第{col_idx + 1}列被设置为{row[(f'{ds_month}',f'{ds_datetime}')]}") 
                elif type(ds_datetime) == datetime.datetime and (ds_datetime in cp_datetime_columns):
                    ds_worksheet.range((row_idx + 3,col_idx + 1)).value = row[(f'{ds_month}',ds_datetime)]     
                    print(f"ds_sheet的第{row_idx + 3}行第{col_idx + 1}列被设置为{row[(f'{ds_month}',ds_datetime)]}")

The above code does solve the problem The problem is solved, that is, the formulas of cells with formulas are retained. However, according to the advice on Python processing excel mentioned at the beginning of our article, this code has serious performance problems, because it frequently operates excel cells through the API, resulting in very slow writing. It runs on my old Mac. It took 40 minutes, which was simply unacceptable, so the plan had to be abandoned.

『Plan 2』

This plan hopes to retain the formula value when reading cells with formula values ​​in excel. This can only be found from the API of each Python excel library to see if there is a corresponding method. I looked carefully at the read_excel() method of Pandas and there is no corresponding parameter support. I found an API that can support Openpyxl, as follows:

import openpyxl
ds_format_workbook = openpyxl.load_workbook(fpath,data_only=False)
ds_wooksheet = ds_format_workbook['DS']
ds_df =  pd.DataFrame(ds_wooksheet.values)

The key is the data_only parameter here. If it is True, the data will be returned. If it is False, the formula value can be retained.

I thought I found it I was overjoyed to find the corresponding solution, but when I saw the data structure in the dataframe read through openpyxl, I was shocked. Because the header of my excel table is a relatively complex two-level header, and there are situations where cells are merged and split in the header. After such a header is read into the dataframe by openpyxl, it does not follow the multi-level header of pandas. The index is processed, but it is simply processed into a numeric index 0123...

But my calculation of the dataframe will rely on multi-level indexes, so this processing method of openpyxl makes my subsequent calculations unable to process.

openpyxl doesn’t work, what about xlwings? After searching through the xlwings API documentation, I actually found it, as shown below:

How does Python process Excel files?

Range class provides a Property called formula, which can get and set formula.

When I saw this, I felt like I had found a treasure, so I quickly started practicing the code. Perhaps out of inertia, or maybe I was frightened by the efficiency of operating Excel by row, column, and cell in the past, the first solution I thought of was to do it in batches at once, that is, to read all the formulas in Excel at once, and then write them all at once. Go back, so my initial code was like this:

#使用xlwings来读取formula
app = xw.App(visible=False,add_book=False)
ds_format_workbook = app.books.open(fpath)
ds_worksheet = ds_format_workbook.sheets["DS"]
#先把所有公式一次性读取并保存下来
formulas = ds_worksheet.used_range.formula
 
#中间计算过程省略...
 
#一次性把所有公式写回去
ds_worksheet.used_range.formula = formulas

But I thought wrongly, ds_worksheet.used_range.formula made me misunderstand that the formula will only return cells with formulas in excel, but in fact it All cells are returned, only formulas are retained for cells with formulas. So, when I rewrite the formula back, it will overwrite the other values ​​I calculated through the dataframe and written to excel.

In this case, I can only process the cells with formulas separately instead of all at once, so the code has to be written like this:

#使用xlwings来读取formula
app = xw.App(visible=False,add_book=False)
ds_format_workbook = app.books.open(fpath)
ds_worksheet = ds_format_workbook.sheets["DS"]
 
#保留excel中的formula
#找到DS中Total所在的行,Total之后的行都是formula
row = ds_df.loc[ds_df[('Total','Capabity')]=='Total ']
total_row_index = row.index.values[0]
#获取对应excel的行号(dataframe把两层表头当做索引,从数据行开始计数,而且从0开始计数。excel从表头就开始计数,而且从1开始计数)
excel_total_row_idx = int(total_row_index+2)
#获取excel最后一行的索引
excel_last_row_idx = ds_worksheet.used_range.rows.count
#保留按日期计算的各列的formula
I_col_formula = ds_worksheet.range(f'I3:I{excel_total_row_idx}').formula
N_col_formula = ds_worksheet.range(f'N3:N{excel_total_row_idx}').formula
T_col_formula = ds_worksheet.range(f'T3:T{excel_total_row_idx}').formula
U_col_formula = ds_worksheet.range(f'U3:U{excel_total_row_idx}').formula
Z_col_formula = ds_worksheet.range(f'Z3:Z{excel_total_row_idx}').formula
AE_col_formula = ds_worksheet.range(f'AE3:AE{excel_total_row_idx}').formula
AK_col_formula = ds_worksheet.range(f'AK3:AK{excel_total_row_idx}').formula
AL_col_formula = ds_worksheet.range(f'AL3:AL{excel_total_row_idx}').formula
#保留Total行开始一直到末尾所有行的formula
total_to_last_formula = ds_worksheet.range(f'A{excel_total_row_idx+1}:AL{excel_last_row_idx}').formula
 
#中间计算过程省略...
 
#保存结果到excel                 
#直接把ds_df完整赋值给excel,会导致excel原有的公式被值覆盖
ds_worksheet.range("A1").expand().options(index=False).value = ds_df 
#用之前保留的formulas,重置公式
ds_worksheet.range(f'I3:I{excel_total_row_idx}').formula = I_col_formula
ds_worksheet.range(f'N3:N{excel_total_row_idx}').formula = N_col_formula
ds_worksheet.range(f'T3:T{excel_total_row_idx}').formula = T_col_formula
ds_worksheet.range(f'U3:U{excel_total_row_idx}').formula = U_col_formula
ds_worksheet.range(f'Z3:Z{excel_total_row_idx}').formula = Z_col_formula
ds_worksheet.range(f'AE3:AE{excel_total_row_idx}').formula = AE_col_formula
ds_worksheet.range(f'AK3:AK{excel_total_row_idx}').formula = AK_col_formula
ds_worksheet.range(f'AL3:AL{excel_total_row_idx}').formula = AL_col_formula
ds_worksheet.range(f'A{excel_total_row_idx+1}:AL{excel_last_row_idx}').formula = total_to_last_formula
 
ds_format_workbook.save()
ds_format_workbook.close()
app.quit()

After testing, the above code solved it perfectly It meets my needs and the performance is absolutely fine.

The above is the detailed content of How does Python process Excel files?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete