Home >Backend Development >Python Tutorial >Handling Outliers in Python - IQR Method
Before uncovering any insights from real-world data, it is important to scrutinize your data to ensure that data is consistent and free from errors. However, Data can contain errors and some values may appear to differ from other values and these values are known as outliers. Outliers negatively impact data analysis leading to wrong insights which lead to poor decision making by stake holders. Therefore, dealing with outliers is a critical step in the data preprocessing stage in data science. In this article, we will asses different ways we can handle outliers.
Outliers are data points that differ significantly from the majority of the data points in a dataset. They are values that fall outside the expected or usual range of values for a particular variable. outliers occur due to various reason for example, error during data entry, sampling errors. In machine learning outliers can cause your models to make incorrect predictions thus causing inaccurate predictions.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') plt.style.use('ggplot')
df_house_price = pd.read_csv(r'C:\Users\Admin\Desktop\csv files\housePrice.csv')
df_house_price.head()
sns.boxplot(df_house_price['Price']) plt.title('Box plot showing outliers in prices') plt.show()
Q1 = df_house_price['Price'].quantile(0.25) Q3 = df_house_price['Price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
Upper bound means any value above 12872625000.0 is an outlier
Remove outlier values in the price column
filt = (df_house_price['Price'] >= lower_bound) & (df_house_price['Price'] <= upper_bound) df = df_house_price[filt] df.head()
sns.boxplot(df['Price']) plt.title('Box plot after removing outliers') plt.show()
IQR method is simple and robust to outliers and does not depend on the normality assumption. The disadvantage is that it can only handle univariate data, and that it can remove valid data points if the data is skewed or has heavy tails.
Thank you
follow me on linked in and on github for more.
The above is the detailed content of Handling Outliers in Python - IQR Method. For more information, please follow other related articles on the PHP Chinese website!