-
Load csv Pandas DataFrame containing condominium data retrieved by from the webscraper. The csv file can be download here from previous webscraping project.
-
Perform some basic data exploration and visualization.
-
Clean up and drop unused columns.
-
Use pd.get_dummies() to convert catagorical data to columns.
-
Define explanatory variables and independent variable (price_sqm in THB).
-
Split the data in train and test set.
-
Find the optimal model parameters for each model.
-
Define a generic function for 10 folds cross-validation and evaluate estimator performance.
-
Using make_pipeline, RobustScaler() for data scaling then pass to the regressor model.
-
The benchmark is R-Square from 10-fold CV.
-
Gradient Boosting regression - 0.7743
-
Neural network models: Multi-layer Perceptron (MLPRegressor) - 0.6647
-
K-Neighbors Regressor - 0.6183
-
Ridge Regression - 0.6130
-
Lasso Regression - 0.6126
-
Random forest regressor - 0.6122
-
Ordinary least squares regression - 0.6116
-
ElasticNet - 0.6116
-
Decision Trees - 0.5517
Even this dataset is quite small with lots of features and we can only predict the price per square meters for each condo, however, this study is very useful for buyers, resellers, agents and even developers to justify the 'fair price' as a starting point based on the current actual market data.
We still did not use the price history data in this project which can be really useful to visualize the trends for each area (which area is growing rapidly, which area is reaching plateau stage).
In the scraping step, we should acquire all listings available in each condo, not only average price per sqm. This should increase numerous numbers of records and it would be very useful to estimate the price for every single room in the future.
Thanks to these people for inspiration: