Home > Article > Backend Development > Five simple and effective Python scripts for cleaning your data
In machine learning, we should be less "data cleaning" and more "data preparation". This script saves me a lot of time when we need to scrape data from white papers, e-books, or other PDF documents.
import tabula #获取文件 pdf_filename = input ("Enter the full path and filename: ") # 提取PDF的内容 frame = tabula.read_pdf(pdf_filename,encoding = 'utf-8', pages='all') #根据内容创建CSV文件 frame.to_csv('pdf_conversion.csv')
This is a relatively simple method to quickly extract data, which can be imported into tools such as machine learning databases, Tableau or Count.
Many systems will offer an export to CSV option, but there is no way to merge the data first before exporting it. This may result in more than 5 files being exported to a folder containing the same data type. This Python script solves this problem by taking these files) and merging them into one file.
from time import strftime import pandas as pd import glob # 定义包含CSV文件的文件夹的路径 path = input('Please enter the full folder path: ') #确保后面有一个斜杠 if path[:-1] != "/": path = path + "/" #以列表形式获取CSV文件 csv_files = glob.glob(path + '*.csv') #打开每个CSV文件并合并为一个文件 merged_file = pd.concat( [ pd.read_csv(c) for c in csv_files ] ) #创建新文件 merged_file.to_csv(path + 'merged_{}.csv'.format(strftime("%m-%d-%yT%H:%M:%S")), index=False) print('Merge complete.')
The final output will give you a CSV file containing all the data in the CSV list you exported from the source system.
If you need to remove duplicate data rows from a CSV file, this can help you quickly perform a cleaning operation. When a machine learning dataset has duplicate data, this can directly impact the results in a visualization tool or machine learning project.
import pandas as pd # 获取文件名 filename = input('filename: ') #定义要检查是否重复的CSV列名 duplicate_header = input('header name: ') #获取文件的内容 file_contents = pd.read_csv(filename) # 删除重复的行 deduplicated_data = file_contents.drop_duplicates(subset=[duplicate_header], keep="last", inplace=True) #创建新文件 deduplicated_data.to_csv('deduplicated_data.csv')
When exporting files from other systems, it sometimes contains one column of data that we need as two columns.
import pandas as pd #获取文件名并定义列 filename = input('filename: ') col_to_split = input('column name: ') col_name_one = input('first new column: ') col_name_two = input('second new column: ') #将CSV数据添加到dataframe中 df = pd.read_csv(filename) # 拆分列 df[[col_name_one,col_name_two]] = df[col_to_split].str.split(",", expand=True) #创建新csv文件 df.to_csv('split_data.csv')
Suppose you have a list of accounts and orders associated with them, and want to view the order history along with the associated account details. A good way to do this is by merging the data into a CSV file.
import pandas as pd #获取文件名并定义用户输入 left_filename = input('LEFT filename: ') right_filename = input('RIGHT filename: ') join_type = input('join type (outer, inner, left, right): ') join_column_name = input('column name(i.e. Account_ID): ') #读取文件到dataframes df_left = pd.read_csv(left_filename) df_right = pd.read_csv(right_filename) #加入dataframes joined_data = pd.merge(left = df_left, right = df_right, how = join_type, on = join_column_name) #创建新的csv文件 joined_data.to_csv('joined_data.csv')
These scripts can effectively help us automatically clean the data, and then load the cleaned data into the machine learning model for processing. Pandas is the library of choice for manipulating data because it offers so many options.
The above is the detailed content of Five simple and effective Python scripts for cleaning your data. For more information, please follow other related articles on the PHP Chinese website!