GitHub - Sim98B/Housing: Predicting California House block's price

Predicting California House blocks'price. Full dataset is available on Kaggle: https://www.kaggle.com/datasets/kathuman/housing

Problem Setting

Every instance refers to an house block in california.
Features refers to both structural and population characteristics of every block.
The block's median price is the target variable.

Analysis

There were 20640 instances and 10 features
207 null-values were detected and dropped, they were just the 1.003%.
5.21% of the whole dataset are outlier, instances outside the interquartile range * 1.5 and were removed; we worked with 19369 instances
Multicollinearity was detected between four features: total rooms, total bedrooms, block population and number of housholders per block. To avoid introducing this multicollinearity 2 new features were created: bedrooms density and household density which didn't strongly correlate with any other feature

Modeling the problem

The whole dataset was split into train, test and validation, respectively 80%, 10% and 10%
First, several models were trained with default parameters to select the top 3
Then hyper parameters of these models were tuned to get the best performance
An ensemble was implemented to outperform basic regressor

Conclusion

Ensmeble seems to be the best model, the one that should be implemented achieving an explained variance on never_seen_data of 80.188%
Ensmeble predictions' residuals don't follow a normal distribution; there is a general tendency for prices to be underestimated. The main problem is the understimation of lower prices bacause it could lead to a negative price's estimation
Despite this this model has a mean error of 42578$, which as percentage is a mean of 16.228% error

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
SV4		SV4
Cover Picture.png		Cover Picture.png
Housing.ipynb		Housing.ipynb
README.md		README.md
housing.csv		housing.csv

Provide feedback