You must have heard this famous data science quote:
In a data science project, 80% of the time is spent doing data processing.
If you haven’t heard of it, remember: data cleaning is the foundation of the data science workflow. Machine learning models perform based on the data you provide them. Messy data can lead to poor performance or even incorrect results, while clean data is a prerequisite for good model performance. Of course, clean data does not mean good performance all the time. The correct selection of the model (the remaining 20%) is also important, but without clean data, even the most powerful model cannot achieve the expected level.
In this article, we will list the problems that need to be solved in data cleaning and show possible solutions. Through this article, you can learn how to perform data cleaning step by step.
Missing values
When the data set contains missing data, some data analysis can be performed before filling. Because the position of the empty cell itself can tell us some useful information. For example:
- NA values only appear at the tail or middle of the data set. This means there may be technical issues during the data collection process. It may be necessary to analyze the data collection process for that particular sample sequence and try to identify the source of the problem.
- If the number of NAs in a column exceeds 70–80%, the column can be deleted.
- If a NA value is in a column that is an optional question on the form, that column can be additionally coded as user answered (1) or not answered (0).
missingno This python library can be used to check the above situation, and it is very simple to use. For example, the white line in the picture below is NA:
import missingno as msno msno.matrix(df)
There are many methods for filling in missing values, such as:
- Mean, median, mode
- kNN
- Zero or constant, etc.
Different methods have advantages and disadvantages over each other, and there is no "best" technique that works in all situations. For details, please refer to our previously published articles
Outliers
Outliers are very large or very small values relative to other points in the data set. Their presence greatly affects the performance of mathematical models. Let’s look at this simple example:
In the left graph there are no outliers and our linear model fits the data points very well. In the image on the right there is an outlier, when the model tries to cover all points of the dataset, the presence of this outlier changes the way the model fits and makes our model not fit for at least half of the points.
For outliers, we need to introduce how to determine anomalies. This requires clarifying what is maximum or minimum from a mathematical perspective.
Any value greater than Q3 1.5 x IQR or less than Q1-1.5 x IQR can be regarded as an outlier. IQR (interquartile range) is the difference between Q3 and Q1 (IQR = Q3-Q1).
You can use the following function to check the number of outliers in the data set:
def number_of_outliers(df): df = df.select_dtypes(exclude = 'object') Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 return ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
One way to deal with outliers is to make them equal to Q3 or Q1. The lower_upper_range function below uses the pandas and numpy libraries to find ranges with outliers outside of them, and then uses the clip function to clip the values to the specified range.
def lower_upper_range(datacolumn): sorted(datacolumn) Q1,Q3 = np.percentile(datacolumn , [25,75]) IQR = Q3 - Q1 lower_range = Q1 - (1.5 * IQR) upper_range = Q3 + (1.5 * IQR) return lower_range,upper_range for col in columns: lowerbound,upperbound = lower_upper_range(df[col]) df[col]=np.clip(df[col],a_min=lowerbound,a_max=upperbound)
Data Inconsistency
The outlier problem is about numeric features, now let’s look at character type (categorical) features. Inconsistent data means that unique classes of columns have different representations. For example, in the gender column, there are both m/f and male/female. In this case, there would be 4 classes, but there are actually two classes.
There is currently no automatic solution for this problem, so manual analysis is required. The unique function of pandas is prepared for this analysis. Let’s look at an example of a car brand:
df['CarName'] = df['CarName'].str.split().str[0] print(df['CarName'].unique())
maxda-mazda, Nissan-nissan, porcshce-porsche, toyouta-toyota etc. can be merged.
df.loc[df['CarName'] == 'maxda', 'CarName'] = 'mazda' df.loc[df['CarName'] == 'Nissan', 'CarName'] = 'nissan' df.loc[df['CarName'] == 'porcshce', 'CarName'] = 'porsche' df.loc[df['CarName'] == 'toyouta', 'CarName'] = 'toyota' df.loc[df['CarName'] == 'vokswagen', 'CarName'] = 'volkswagen' df.loc[df['CarName'] == 'vw', 'CarName'] = 'volkswagen'
Invalid data
Invalid data represents a value that is not logically correct at all. For example,
- A person's age is 560;
- A certain operation took -8 hours;
- A person's height is 1200 cm, etc.;
For numeric columns, pandas' describe function can be used to identify such errors:
df.describe()
There may be two reasons for invalid data:
1. Data collection errors: For example, the range was not judged during input. When entering height, 1799cm was mistakenly entered instead of 179cm, but the program did not judge the range of the data.
2. Data operation error
数据集的某些列可能通过了一些函数的处理。 例如,一个函数根据生日计算年龄,但是这个函数出现了BUG导致输出不正确。
以上两种随机错误都可以被视为空值并与其他 NA 一起估算。
重复数据
当数据集中有相同的行时就会产生重复数据问题。 这可能是由于数据组合错误(来自多个来源的同一行),或者重复的操作(用户可能会提交他或她的答案两次)等引起的。 处理该问题的理想方法是删除复制行。
可以使用 pandas duplicated 函数查看重复的数据:
df.loc[df.duplicated()]
在识别出重复的数据后可以使用pandas 的 drop_duplicate 函数将其删除:
df.drop_duplicates()
数据泄漏问题
在构建模型之前,数据集被分成训练集和测试集。 测试集是看不见的数据用于评估模型性能。 如果在数据清洗或数据预处理步骤中模型以某种方式“看到”了测试集,这个就被称做数据泄漏(data leakage)。 所以应该在清洗和预处理步骤之前拆分数据:
以选择缺失值插补为例。数值列中有 NA,采用均值法估算。在 split 前完成时,使用整个数据集的均值,但如果在 split 后完成,则使用分别训练和测试的均值。
第一种情况的问题是,测试集中的推算值将与训练集相关,因为平均值是整个数据集的。所以当模型用训练集构建时,它也会“看到”测试集。但是我们拆分的目标是保持测试集完全独立,并像使用新数据一样使用它来进行性能评估。所以在操作之前必须拆分数据集。
虽然训练集和测试集分别处理效率不高(因为相同的操作需要进行2次),但它可能是正确的。因为数据泄露问题非常重要,为了解决代码重复编写的问题,可以使用sklearn 库的pipeline。简单地说,pipeline就是将数据作为输入发送到的所有操作步骤的组合,这样我们只要设定好操作,无论是训练集还是测试集,都可以使用相同的步骤进行处理,减少的代码开发的同时还可以减少出错的概率。
The above is the detailed content of A complete guide to data cleaning with Python. For more information, please follow other related articles on the PHP Chinese website!

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

SublimeText3 Linux new version
SublimeText3 Linux latest version

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Zend Studio 13.0.1
Powerful PHP integrated development environment

SecLists
SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.