Predicting California House blocks'price. Full dataset is available on Kaggle: https://www.kaggle.com/datasets/kathuman/housing
Problem Setting
- Every instance refers to an house block in california.
- Features refers to both structural and population characteristics of every block.
- The block's median price is the target variable.
Analysis
- There were 20640 instances and 10 features
- 207 null-values were detected and dropped, they were just the 1.003%.
- 5.21% of the whole dataset are outlier, instances outside the interquartile range * 1.5 and were removed; we worked with 19369 instances
- Multicollinearity was detected between four features: total rooms, total bedrooms, block population and number of housholders per block. To avoid introducing this multicollinearity 2 new features were created: bedrooms density and household density which didn't strongly correlate with any other feature
Modeling the problem
- The whole dataset was split into train, test and validation, respectively 80%, 10% and 10%
- First, several models were trained with default parameters to select the top 3
- Then hyper parameters of these models were tuned to get the best performance
- An ensemble was implemented to outperform basic regressor
Conclusion
- Ensmeble seems to be the best model, the one that should be implemented achieving an explained variance on never_seen_data of 80.188%
- Ensmeble predictions' residuals don't follow a normal distribution; there is a general tendency for prices to be underestimated. The main problem is the understimation of lower prices bacause it could lead to a negative price's estimation
- Despite this this model has a mean error of 42578$, which as percentage is a mean of 16.228% error