Boston House Market Price Prediction

Introduction

Housing is one of life's essential needs, along with food, water, and other fundamentals. As living standards improve, the demand for housing continues to rise. The housing market significantly impacts a country's economy and currency. Various factors influence housing sales prices, including the property area, location, construction materials, age, number of bedrooms, and garages.

A house price prediction model provides valuable insights into current market valuations for home buyers, property investors, and housebuilders. It assists in identifying features that match a buyer's budget and enables investors and builders to make informed decisions.

Objective

The goal of this project is to predict housing prices in Boston suburbs based on various locality features. This includes identifying the most important features that affect house prices, preprocessing the data, and building regression models to predict prices for unseen data.

Dataset

The dataset used for this project consists of records describing suburbs or towns in the Boston Standard Metropolitan Statistical Area (SMSA). It includes various attributes, such as crime rates, the proportion of residential land, and the median value of owner-occupied homes.

Data Dictionary

Attribute	Description
CRIM	Per capita crime rate by town
ZN	Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS	Proportion of non-retail business acres per town
CHAS	Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX	Nitric oxide concentration (parts per 10 million)
RM	Average number of rooms per dwelling
AGE	Proportion of owner-occupied units built before 1940
DIS	Weighted distances to five Boston employment centers
RAD	Index of accessibility to radial highways
TAX	Full-value property-tax rate per $10,000
PTRATIO	Pupil-teacher ratio by town
LSTAT	Percentage of lower-status population
MEDV	Median value of owner-occupied homes in $1000s

Exploratory Data Analysis (EDA)

EDA was conducted to understand the structure of the dataset, detect patterns, and identify any anomalies. This step informed the data preprocessing and modeling strategies.

Steps for EDA:

Distribution of MEDV
- Examined the distribution of the target variable, MEDV (median value of homes), to understand its central tendency, spread, and potential skewness.
Correlation Heatmap
- Analyzed the correlation between MEDV and other features, identifying strong relationships that guided feature selection.
Univariate Analysis
- Conducted individual analysis of each feature to detect patterns, potential outliers, and skewness.
Bivariate Analysis
- Explored relationships between features that exhibited significant correlations (>= 0.7 or <= -0.7) with each other and with MEDV.
Skewness Check and Log Transformation
- Checked the skewness of MEDV and applied a log transformation to normalize its distribution when necessary.

Feature Engineering

Several transformations were performed to enhance the predictive power of the dataset:

Polynomial Features: Created polynomial features for LSTAT to capture non-linear relationships.
Log Transformation: Applied log transformations to skewed features like CRIM and DIS, creating new features (CRIM_log, DIS_log).
Interaction Features: Created interaction terms (CRIM_LSTAT, AGE_LSTAT) to capture combined effects.
Binning and Encoding: Binned AGE into categories (New, Moderate, Old) and encoded them into dummy variables.

Data Preparation for Modeling

Standardized numerical features using the StandardScaler to ensure they were on a similar scale. This step is essential for machine learning algorithms to perform optimally.

Initial Model Training and Evaluation

An initial linear regression model was trained using the preprocessed data. The model was evaluated using key performance metrics:

Mean Absolute Error (MAE): 0.3098
Root Mean Squared Error (RMSE): 0.4321
R-squared (R²) Score: 0.7901
Mean Absolute Percentage Error (MAPE): 211.40%

Variance Inflation Factor (VIF) Analysis

Calculated VIF values to detect multicollinearity among features. High VIF values were addressed by feature engineering, and the recalculated VIF values showed significant improvement.

Cross-Validation Results

Performed cross-validation to evaluate the model's generalization:

Cross-Validation RMSE: 0.5729 ± 0.2420
This indicates consistent model performance across different subsets of the data.

Regularized Regression Models

Explored Lasso and Ridge regression models to improve predictive performance:

Lasso Regression: Best alpha = 0.01, RMSE = 0.6162
Ridge Regression: Best alpha = 10, RMSE = 0.6200

These models did not significantly improve upon the initial linear regression model but provided insights into feature selection and multicollinearity handling.

Checking Linear Regression Assumptions

Verified that the assumptions of linear regression were satisfied (linearity, independence, homoscedasticity, and normality) using diagnostic plots.

Final Cross-Validation Results

The final model performance was consistent with the initial results, confirming that the linear regression model was well-suited for predicting housing prices with a low average prediction error.

Conclusion

The analysis of the Boston housing data has provided valuable insights into the factors influencing house prices. The linear regression model demonstrated good predictive performance and identified significant predictors, including:

NOX, RM, RAD, TAX, PTRATIO, DIS_log, and CRIM_LSTAT were the most significant predictors.
Model Performance: The linear regression model outperformed Lasso and Ridge regression models.

Recommendations

Utilize Linear Regression for Prediction: The linear regression model, with its lower RMSE, is recommended for predicting housing prices.
Leverage Lasso for Feature Selection: While Lasso did not outperform linear regression, its feature selection capability can simplify models by identifying the most significant predictors.

Policy and Planning Insights:

Improve Air Quality: Reducing nitric oxide (NOX) pollution can potentially increase property values.
Promote Larger Living Spaces: Developing properties with more rooms (RM) can boost property values.
Enhance Infrastructure: Improving accessibility to radial highways (RAD) positively impacts property prices.
Invest in Educational Quality: Enhancing the pupil-teacher ratio (PTRATIO) can raise property values.
Urban Planning: Shorter distances to employment centers (DIS_log) increase property values, suggesting that urban planning should focus on reducing commuting times.

Technologies Used

Python
Pandas
NumPy
Matplotlib
Seaborn
Scikit-learn
Statsmodels

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
jupyter-notebook		jupyter-notebook
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boston House Market Price Prediction

Introduction

Objective

Dataset

Data Dictionary

Exploratory Data Analysis (EDA)

Steps for EDA:

Feature Engineering

Data Preparation for Modeling

Initial Model Training and Evaluation

Variance Inflation Factor (VIF) Analysis

Cross-Validation Results

Regularized Regression Models

Checking Linear Regression Assumptions

Final Cross-Validation Results

Conclusion

Recommendations

Policy and Planning Insights:

Technologies Used

License

About

Releases

Packages

Languages

License

carlosrod723/Boston-Regression-House-Price-Prediction

Folders and files

Latest commit

History

Repository files navigation

Boston House Market Price Prediction

Introduction

Objective

Dataset

Data Dictionary

Exploratory Data Analysis (EDA)

Steps for EDA:

Feature Engineering

Data Preparation for Modeling

Initial Model Training and Evaluation

Variance Inflation Factor (VIF) Analysis

Cross-Validation Results

Regularized Regression Models

Checking Linear Regression Assumptions

Final Cross-Validation Results

Conclusion

Recommendations

Policy and Planning Insights:

Technologies Used

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages