


EXPLORATORY DATA ANALYSIS (EDA) WITH PYTHON: UNCOVERING INSIGHTS FROM DATA.
INTRODUCTION
Exploratory Data Analysis (EDA) is crucial in data analysis, for the fact that it enable analysts to uncover insights and prepare data for further modeling. In this article, we’ll dive into various EDA techniques and tools available in Python, to enhance your data understanding. From cleaning/processing your dataset to visualizing your findings and using Python in telling stories with data.
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a method of analyzing datasets to understand their main characteristics. It involves summarizing data features, detecting patterns, and uncovering relationships through visual and statistical techniques. EDA helps in gaining insights and formulating hypotheses for further analysis.
Exploratory Data Analysis (EDA) in Python employs various techniques that are essential for uncovering insights from data. One of the foundational techniques involves data visualization using libraries such as Matplotlib and Seaborn. These tools allow data scientists to create different types of plots, including scatter plots, histograms, and box plots, which are critical for understanding the distribution and relationships within datasets.
By visualizing data, analysts can identify trends, outliers, and patterns that may not be evident through numerical analysis alone.
Another crucial technique in EDA is data cleaning and manipulation, primarily facilitated by the Pandas library. This involves processing datasets by handling missing values, filtering data, and employing aggregative functions to summarize insights. The application of functions like ‘groupby’ enables users to segment data into meaningful categories, thus facilitating a clearer analysis. Additionally, incorporating statistical methods such as correlation analysis provides further understanding of relationships between variables, helping to formulate hypotheses that can be tested in more structured analysis.
HOW TO PERFORM EDA USING PYTHON
Step 1: Import Python Libraries
The first step involved in ML using python is understanding and playing around with our data using libraries. You can use this link to get dataset on the Kaggle website : https://www.kaggle.com/datasets/sukhmanibedi/cars4u
Import all libraries required for our analysis, such as those for data loading, statistical analysis, visualizations, data transformations, and merging and joining.
Pandas and Numpy have been used for Data Manipulation and numerical Calculations
Matplotlib and Seaborn have been used for Data visualizations.
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
To ignore warnings
import warnings
warnings.filterwarnings('ignore')
STEP 2: READING DATASET
The python Pandas library offers a wide range of possibilities for loading data into the pandas DataFrame from files like images, .csv, .xlsx, .sql, .pickle, .html, .txt, etc.
Most of the data are available in a tabular format of CSV files. It is trendy and easy to access. Using the read_csv() function, data can be converted to a pandas DataFrame.
In this article, the data to predict Used car price is being used as an example. In this dataset, we are trying to analyze the used car’s price and how EDA focuses on identifying the factors influencing the car price. We have stored the data in the DataFrame data.
data = pd.read_csv("used_cars.csv")
ANALYZING THE DATA
Before we make any inferences, we listen to our data by examining all variables in the data.
The main goal of data understanding is to gain general insights about the data, which covers the number of rows and columns, values in the data, datatypes, and Missing values in the dataset.
shape – shape will display the number of observations(rows) and features(columns) in the dataset
There are 7253 observations and 14 variables in our dataset
head() will display the top 5 observations of the dataset
data.head()
tail() will display the last 5 observations of the dataset
data.tail()
info() helps to understand the data type and information about data, including the number of records in each column, data having null or not null, Data type, the memory usage of the dataset
data.info()
data.info() shows the variables Mileage, Engine, Power, Seats, New Price, and Price have missing values. Numeric variables like Mileage, Power are of datatype as; float64 and int64. Categorical variables like Location, Fuel_Type, Transmission, and Owner Type are of object data type.
CHECK FOR DUPLICATION
nunique() based on several unique values in each column and the data description, we can identify the continuous and categorical columns in the data. Duplicated data can be handled or removed based on further analysis.
data.nunique()
https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/
Missing Values Calculation
isnull() is widely been in all pre-processing steps to identify null values in the data
In our example, data.isnull().sum() is used to get the number of missing records in each column
data.isnull().sum()
The below code helps to calculate the percentage of missing values in each column
(data.isnull().sum()/(len(data)))*100
The percentage of missing values for the columns New_Price and Price is ~86% and ~17%, respectively.
STEP 3: DATA REDUCTION
Some columns or variables can be dropped if they do not add value to our analysis.
In our dataset, the column S.No have only ID values, assuming they don’t have any predictive power to predict the dependent variable.
Remove S.No. column from data
data = data.drop(['S.No.'], axis = 1)
data.info()
We start our Feature Engineering as we need to add some columns required for analysis.
Step 4: Feature Engineering
Feature engineering refers to the process of using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model using machine learning or statistical modeling. The main goal of Feature engineering is to create meaningful data from raw data.
Step 5: Creating Features
We will play around with the variables Year and Name in our dataset. If we see the sample data, the column “Year” shows the manufacturing year of the car.
It would be difficult to find the car’s age if it is in year format as the Age of the car is a contributing factor to Car Price.
Introducing a new column, “Car_Age” to know the age of the car
from datetime import date
date.today().year
data['Car_Age']=date.today().year-data['Year']
data.head()
Since car names will not be great predictors of the price in our current data. But we can process this column to extract important information using brand and Model names. Let’s split the name and introduce new variables “Brand” and “Model”
data['Brand'] = data.Name.str.split().str.get(0)
data['Model'] = data.Name.str.split().str.get(1) data.Name.str.split().str.get(2)
data[['Name','Brand','Model']]
STEP 6: DATA CLEANING/WRANGLING
Some names of the variables are not relevant and not easy to understand. Some data may have data entry errors, and some variables may need data type conversion. We need to fix this issue in the data.
In the example, The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks incorrect.
This needs to be corrected
print(data.Brand.unique())
print(data.Brand.nunique())
searchfor = ['Isuzu' ,'ISUZU','Mini','Land']
data[data.Brand.str.contains('|'.join(searchfor))].head(5)
data["Brand"].replace({"ISUZU": "Isuzu", "Mini": "Mini Cooper","Land":"Land Rover"}, inplace=True)
We have done the fundamental data analysis, Featuring, and data clean-up.
Let’s move to the EDA process
Read about fundamentals of exploratory data analysis: https://www.analyticsvidhya.com/blog/2021/11/fundamentals-of-exploratory-data-analysis/
STEP 7: EDA EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis refers to the crucial process of performing initial investigations on data to discover patterns to check assumptions with the help of summary statistics and graphical representations.
• EDA can be leveraged to check for outliers, patterns, and trends in the given data.
• EDA helps to find meaningful patterns in data.
• EDA provides in-depth insights into the data sets to solve our business problems.
• EDA gives a clue to impute missing values in the dataset
STEP 8: STATISTICS SUMMARY
The information gives a quick and simple description of the data.
Can include Count, Mean, Standard Deviation, median, mode, minimum value, maximum value, range, standard deviation, etc.
Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error, distribution of data such as the data is normally distributed or left/right skewed
In python, this can be achieved using describe()
describe() function gives all statistics summary of data
describe() ; Provide a statistics summary of data belonging to numerical datatype such as int, float
data.describe().T
From the statistics summary, we can infer the below findings :
• Years range from 1996- 2019 and has a high in a range which shows used cars contain both latest models and old model cars.
• On average of Kilometers-driven in Used cars are ~58k KM. The range shows a huge difference between min and max as max values show 650000 KM shows the evidence of an outlier. This record can be removed.
• Min value of Mileage shows 0 cars won’t be sold with 0 mileage. This sounds like a data entry issue.
• It looks like Engine and Power have outliers, and the data is right-skewed.
• The average number of seats in a car is 5. car seat is an important feature in price contribution.
• The max price of a used car is 160k which is quite weird, such a high price for used cars. There may be an outlier or data entry issue.
describe(include=’all’) provides a statistics summary of all data, include object, category etc
data.describe(include='all')
Before we do EDA, lets separate Numerical and categorical variables for easy analysis
cat_cols=data.select_dtypes(include=['object']).columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)
Also, Read about the article Standard Deviation in Excel and Sheets https://www.analyticsvidhya.com/blog/2024/06/standard-deviation-in-excel/
STEP 9: EDA UNIVARIATE ANALYSIS
Analyzing/visualizing the dataset by taking one variable at a time:
Data visualization is essential; we must decide what charts to plot to better understand the data. In this article, we visualize our data using Matplotlib and Seaborn libraries.
Matplotlib is a Python 2D plotting library used to draw basic charts..
Seaborn is also a python library built on top of Matplotlib that uses short lines of code to create and style statistical plots from Pandas and Numpy
Univariate analysis can be done for both Categorical and Numerical variables.
Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.
Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.
In our example, we have done a Univariate analysis using Histogram and Box Plot for continuous Variables.
In the below fig, a histogram and box plot is used to show the pattern of the variables, as some variables have skewness and outliers.
for col in num_cols:
print(col)
print('Skew :', round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(grid=False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x=data[col])
plt.show()
Price and Kilometers Driven are right skewed for this data to be transformed, and all outliers will be handled during imputation categorical variables are being visualized using a count plot. Categorical variables provide the pattern of factors influencing car price.
fig, axes = plt.subplots(3, 2, figsize = (18, 18))
fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'Fuel_Type', data = data, color = 'blue',
order = data['Fuel_Type'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'Transmission', data = data, color = 'blue',
order = data['Transmission'].value_counts().index);
sns.countplot(ax = axes[1, 0], x = 'Owner_Type', data = data, color = 'blue',
order = data['Owner_Type'].value_counts().index);
sns.countplot(ax = axes[1, 1], x = 'Location', data = data, color = 'blue',
order = data['Location'].value_counts().index);
sns.countplot(ax = axes[2, 0], x = 'Brand', data = data, color = 'blue',
order = data['Brand'].head(20).value_counts().index);
sns.countplot(ax = axes[2, 1], x = 'Model', data = data, color = 'blue',
order = data['Model'].head(20).value_counts().index);
axes[1][1].tick_params(labelrotation=45);
axes[2][0].tick_params(labelrotation=90);
axes[2][1].tick_params(labelrotation=90);
From the count plot, we can have below observations
• Mumbai has the highest number of cars available for purchase, followed by Hyderabad and Coimbatore
• ~53% of cars have fuel type as Diesel this shows diesel cars provide higher performance
• ~72% of cars have manual transmission
• ~82 % of cars are First owned cars. This shows most of the buyers prefer to purchase first-owner cars
• ~20% of cars belong to the brand Maruti followed by 19% of cars belonging to Hyundai
• WagonR ranks first among all models which are available for purchase.
CONCLUSION:
Exploratory data analysis (EDA) uncovers insights and knowledge from datasets by detecting outliers, key patterns, and relationships among variables. It involves collecting, cleaning, and transforming data to unveil its attributes.
Happy Reading and Let’s explore the future of Data Science together…
The above is the detailed content of EXPLORATORY DATA ANALYSIS (EDA) WITH PYTHON: UNCOVERING INSIGHTS FROM DATA. For more information, please follow other related articles on the PHP Chinese website!

This tutorial demonstrates how to use Python to process the statistical concept of Zipf's law and demonstrates the efficiency of Python's reading and sorting large text files when processing the law. You may be wondering what the term Zipf distribution means. To understand this term, we first need to define Zipf's law. Don't worry, I'll try to simplify the instructions. Zipf's Law Zipf's law simply means: in a large natural language corpus, the most frequently occurring words appear about twice as frequently as the second frequent words, three times as the third frequent words, four times as the fourth frequent words, and so on. Let's look at an example. If you look at the Brown corpus in American English, you will notice that the most frequent word is "th

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

In this tutorial you'll learn how to handle error conditions in Python from a whole system point of view. Error handling is a critical aspect of design, and it crosses from the lowest levels (sometimes the hardware) all the way to the end users. If y

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver Mac version
Visual web development tools

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.
