This is a Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on how car prices are affected by these variables from dataset (Version 10).
For detailed walk-through, please view the source code in order from:
- Exploratory Data Analysis, Part 1 & Part 2
- Data cleaning and Feature engineering
- Machine model exploration
- XGBoost tree
- Conclusion
- Video Presentation
Main problem: How different variables such as [odometer, year, condition, fuel, title_status, transmission, drive, state, manufacturer] affect the resale price of a car.
Subproblem 1: How do different models (Tree regressor, linear regression, tree classifier) respond to the variables?
Subproblem 2: Does applying feature engineering to the dataset change the way the model respond (Tree regressor, linear regression, tree classifier) to the dataset?
Subproblem 3: Is there a better model (XGBoost) we can use to better predict the price of the car?
- Linear Regression
- Tree Regression
- Tree Classifier
- XGBoost
The video to the presentation can be accessed on Youtube here. Slides and the video file are also uploaded to the repository for easy viewing here.
-
The dataset provided on kaggle is very dirty in general a. Prices contain a lot of erroneous data points b. Many missing values in many categorical variables
-
Just fitting the data points into any machine model with this dataset as seen on many kaggle notebooks yields poor results a. By performing some feature engineering we are able to utilise some of the variables that have previously low correlation with price.
-
Use of classification (especially XGBoost) can generate better prediction compared to rudimentary models like linear regression and tree classifier.
-
It is possible to predict resale price of the car with even greater accuracy, and more can be explored with other methods such as neural networks
- Regression models may not always be the best approach and other models could give rise to a better fit
- XGBoost and the logic behind the model (concepts about Precision, Recall, and F1 Score)
- Feature engineering techniques such as imputation, discretization... (more in the notebooks)
- Working with each other on Github and some of the utilities to support remote collaboration.
- To not blindly fit dataset into models as some variables can be insanely skewed
We are from Lab group SC5, Project Group 1
- @HiIAmTzeKean (Data cleaning, Machine model, Conclusions)
- @Ki-ann (XGBoost)
- @onghaixiang (EDA)
- https://www.kaggle.com/austinreese/craigslist-carstrucks-data
- https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423
- https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
- https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/