Home >Backend Development >Python Tutorial >House_Price_Prediction

House_Price_Prediction

Patricia Arquette
Patricia ArquetteOriginal
2024-11-03 12:28:29263browse

In the world of real estate, determining property prices involves numerous factors, from location and size to amenities and market trends. Simple linear regression, a foundational technique in machine learning, provides a practical way to predict housing prices based on key features like the number of rooms or square footage.

In this article, I delve into the process of applying simple linear regression to a housing dataset, from data preprocessing and feature selection to building a model that can offer valuable price insights. Whether you’re new to data science or seeking to deepen your understanding, this project serves as a hands-on exploration of how data-driven predictions can shape smarter real estate decisions.

First things first, you start by importing your libraries:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#Read from the directory where you stored the data

data  = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
data

House_Price_Prediction

House_Price_Prediction

#Test to see if there arent any null values
data.info()

House_Price_Prediction

#Trying to draw the same number of null values
data.dropna(inplace = True)
data.info()

House_Price_Prediction

#From our data, we are going to train and test our data

from sklearn.model_selection import train_test_split

X = data.drop(['median_house_value'], axis = 1)
y = data['median_house_value']
y

House_Price_Prediction

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#Examining correlation between x and y training data
train_data = X_train.join(y_train)
train_data

House_Price_Prediction

House_Price_Prediction

#Visualizing the above
train_data.hist(figsize=(15, 8))

House_Price_Prediction

#Encoding non-numeric columns to see if they are useful and categorical for analysis

train_data_encoded = pd.get_dummies(train_data, drop_first=True)
correlation_matrix = train_data_encoded.corr()
print(correlation_matrix)

House_Price_Prediction

House_Price_Prediction

House_Price_Prediction

train_data_encoded.corr()

House_Price_Prediction

House_Price_Prediction

House_Price_Prediction

plt.figure(figsize=(15,8))
sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")

House_Price_Prediction

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#Read from the directory where you stored the data

data  = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')

House_Price_Prediction

data

ocean_proximity
INLAND 5183
NEAR OCEAN 2108
NEAR BAY 1783
ISLAND 5
Name: count, dtype: int64

#Test to see if there arent any null values
data.info()

House_Price_Prediction

#Trying to draw the same number of null values
data.dropna(inplace = True)
data.info()

House_Price_Prediction

House_Price_Prediction

#From our data, we are going to train and test our data

from sklearn.model_selection import train_test_split

X = data.drop(['median_house_value'], axis = 1)
y = data['median_house_value']

House_Price_Prediction

y

House_Price_Prediction

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
#Examining correlation between x and y training data
train_data = X_train.join(y_train)

House_Price_Prediction

train_data

House_Price_Prediction

#Visualizing the above
train_data.hist(figsize=(15, 8))
#Encoding non-numeric columns to see if they are useful and categorical for analysis

train_data_encoded = pd.get_dummies(train_data, drop_first=True)
correlation_matrix = train_data_encoded.corr()
print(correlation_matrix)
train_data_encoded.corr()
plt.figure(figsize=(15,8))
sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")
train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1)
train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] +1)
train_data['population'] = np.log(train_data['population'] + 1)
train_data['households'] = np.log(train_data['households'] + 1)
train_data.hist(figsize=(15, 8))

0.5092972905670141

#convert ocean_proximity factors into binary's using one_hot_encoding
train_data.ocean_proximity.value_counts()

House_Price_Prediction

#For each feature of the above we will then create its binary(0 or 1)
pd.get_dummies(train_data.ocean_proximity)

0.4447616558596853

#Dropping afterwards the proximity
train_data = train_data.join(pd.get_dummies(train_data.ocean_proximity)).drop(['ocean_proximity'], axis=1)

House_Price_Prediction

train_data

House_Price_Prediction

#recheck for correlation
plt.figure(figsize=(18, 8))
sns.heatmap(train_data.corr(), annot=True, cmap ='twilight')

0.5384474921332503

I would really say that training a machine is not the easiest of processes but to keep improving the results above you can add more features under the param_grid such as the min_feature and in that way your best estimator score can keep on improvimng.

If you got till this far please like and share your comment below, your opinion really matters. Thank you!??❤️

The above is the detailed content of House_Price_Prediction. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn