Authors: Maureen Kitang'a, Samuel Kyalo, Priscila Kamiri, Leo Kariuki, Jimcollins Wamae & Steve Githinji
This project is a part of the Data Science (DSF-FT) Course at Moringa School. The full project description can be found here.
Real estate developers are interested in identifying factors that influence the sale price of homes in King County, as well as developing models to predict the sale price of homes based on these factors. This information can be used to optimize the design and marketing of new properties, identify investment opportunities, and make data-driven decisions about the development and sale of properties.
The stakeholder is a Real Estate Developer in a northwestern county who identifies and acquires land for new development projects, as well as oversees the construction process from start to finish. The Real Estate developer want to gain market insights that will lead to an increase in house sales.
This project uses the King County House Sales dataset. The dataset contains information on over 21,000 home sales in King County, Washington, USA between May 2014 and May 2015. The dataset includes a variety of features such as the number of bedrooms and bathrooms, the size of the property, the location of the property, and various other attributes that may influence the sale price of a home.
In this section, we shall be building the models.
We will first start by building a baseline model. The baseline model will be used to compare the performance of the other models that we will be building. After that, we will build our multiple linear regression model.
The target variable is price. Therefore, we look at the correlation coefficients for all of the predictor variables to find the one with the highest correlation with price.
Our most strongly correlated variable with price
is sqm_living
The model is statistically significant as it explains only 49.3% of the variance in the data. However, it is a good starting point for our analysis. Each prediction is off by about $173,829.
- The coefficient of
sqm_living
is about $3023. This means that for every square meter increase in the house, the price of the house increases by about $3023.
We will now iterate the baseline model by building a multiple linear regression model that will have more than one independent variable.
All of the coefficients are statistically significant, and explains about 65% of the variance in price. Each prediction is off by about $ 140,692.20
We will now build another iterated model.
##Final Model Regression Results
The results showed that:
- A one-unit increase in the number of bathrooms is associated with an increase of $16,410.00 in home price.
- A one-unit increase in square metre of living space is associated with an increase of $ 2813.50 in home price.
- A one-unit increase in square metre of the lot size is associated with a increase of $ 2.9292 in home price.
- A one-unit increase in the square metre of the basement is associated with an decrease of $ 663.40 in home price.
- A one-unit increase in the year the home was built is associated with a decrease of $ 838.71 in home price.
- Compared to zipcode_98103, zipcode_98004 has the highest increase of $462,900 in home price.
- Compared to zipcode_98103, zipcode_98092 has the highest decrease of $381,700 in home price.
Multiple Linear Model 2 was chosen as the final model. This is beacause it explained about 74 % of the variance in price, about 10% more than Multiple Linear Model 1. It also had a lower Mean Absolute Error, by about $ 32,000.
From the final model, bathroom is associated with bringing the highest increase in sale price.
An increase in sqm_living count by 1 unit had the second highest associated increase in price.
Compared to zipcode_98103, zipcode_98004 has the highest increase of $462,900 in home price.
When building new houses, The Real Estate Developer should therefore prioritize:
increasing the number of bathrooms, consider the size(square meters) of the living space consider building houses in the postal area of zipcode_98004
A further study may be required with a larger dataset for better insights.
More factors like the population in each area may be considered
See the full analysis in the Jupyter Notebook or review this presentation.
Contact me: ndungek66@gmail.com