House Pricing Prediction

This project involves building machine learning models to predict house prices in Ames, Iowa, using a dataset introduced by Dean De Cock. The dataset includes comprehensive details about residential properties, and the goal is to create a reliable predictive tool to assist real estate agents and enhance the decision-making process for property pricing. This project was part of the Kaggle competition House Prices: Advanced Regression Techniques.

Problem Definition

The real estate industry requires accurate property valuation to match buyers with properties that suit their budgets. This project addresses this need by predicting house prices based on various features. The objectives are to:

Classify properties into appropriate price ranges.
Minimize errors, especially underestimating house prices.
Enhance customer satisfaction by aligning property suggestions with budgets.

For further details, see the Business Understanding section in the report.

Data Understanding

The dataset consists of 1460 rows with 81 features, covering qualitative and quantitative aspects of residential properties. Key observations:

The dataset is imbalanced, with fewer high-priced houses compared to medium- and low-priced ones:
Strong correlations exist between features like Total Square Footage, Overall Quality, Neighborhood, and the target variable (sale price). Below is the correlation graph highlighting key relationships and redundancies among features:

Refer to the Data Understanding section in the report for additional insights.

Data Preparation

Feature Selection: Attributes were selected based on statistical significance and their relationship to the target feature.
Data Cleaning: Addressed null values and inconsistencies, e.g., replacing "NA" with appropriate substitutes based on context.
Resampling: To address the imbalanced dataset, the SMOTENC resampling technique was applied to the minority class (high-priced houses). Below is the resulting distribution:
Normalization & Aggregation:
- Features were normalized using StandardScaler.
- Aggregated attributes like TotalSF, combining square footage across different property areas.

For more details, see the Data Preparation section in the report.

Implemented Models

We tested various machine learning models, including:

Decision Tree
Random Forest
Gradient Boosting
Naive Bayes
Neural Network

Best Performers:

Random Forest: Highest accuracy and balanced prediction across classes.
Gradient Boosting: Comparable accuracy with faster prediction times.

Model	Accuracy (avg)	Precision (avg)	Recall (avg)	F1-Score (avg)
Decision Tree	0.78	0.76	0.74	0.75
Random Forest	0.90	0.90	0.90	0.89
Gradient Boosting	0.90	0.88	0.90	0.82

Confusion Matrices:

Random Forest: Robust identification of high-value properties.
Gradient Boosting: Precise classification for low-value properties.

ROC Curves for Best Models:
The models demonstrate strong performance, particularly for the high-price class due to resampling.

Refer to the Modelling and Validation section in the report for more details on model evaluation, tuning and performance.

Results and Evaluation

The Random Forest model demonstrated superior performance in terms of:

Accuracy: 90%
Balanced classification across price ranges.
Practical prediction times suitable for real-world application.

Gradient Boosting offered comparable quality but excelled in speed for predictions. Both models benefited from resampling techniques to balance the imbalanced dataset.

For additional analysis, refer to the Evaluation section in the report.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
data		data
docs		docs
LICENSE		LICENSE
README.md		README.md
code.ipynb		code.ipynb
correlation_results		correlation_results
results_undestand_all_cliques		results_undestand_all_cliques
test_max_clique.py		test_max_clique.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

House Pricing Prediction

Problem Definition

Data Understanding

Data Preparation

Implemented Models

Best Performers:

Results and Evaluation

About

Releases

Packages

Contributors 5

Languages

License

gvnberaldi/HousePricePrediction

Folders and files

Latest commit

History

Repository files navigation

House Pricing Prediction

Problem Definition

Data Understanding

Data Preparation

Implemented Models

Best Performers:

Results and Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages