Welcome to car-resale-analysis repository

About

This is a Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on how car prices are affected by these variables from dataset (Version 10).

For detailed walk-through, please view the source code in order from:

Exploratory Data Analysis, Part 1 & Part 2
Data cleaning and Feature engineering
Machine model exploration
XGBoost tree
Conclusion
Video Presentation

Problem Definition

Main problem: How different variables such as [odometer, year, condition, fuel, title_status, transmission, drive, state, manufacturer] affect the resale price of a car.
Subproblem 1: How do different models (Tree regressor, linear regression, tree classifier) respond to the variables?
Subproblem 2: Does applying feature engineering to the dataset change the way the model respond (Tree regressor, linear regression, tree classifier) to the dataset?
Subproblem 3: Is there a better model (XGBoost) we can use to better predict the price of the car?

Models Used

Linear Regression
Tree Regression
Tree Classifier
XGBoost

Presentation

The video to the presentation can be accessed on Youtube here. Slides and the video file are also uploaded to the repository for easy viewing here.

Conclusion

The dataset provided on kaggle is very dirty in general a. Prices contain a lot of erroneous data points b. Many missing values in many categorical variables
Just fitting the data points into any machine model with this dataset as seen on many kaggle notebooks yields poor results a. By performing some feature engineering we are able to utilise some of the variables that have previously low correlation with price.
Use of classification (especially XGBoost) can generate better prediction compared to rudimentary models like linear regression and tree classifier.
It is possible to predict resale price of the car with even greater accuracy, and more can be explored with other methods such as neural networks

What did we learn from this project?

Regression models may not always be the best approach and other models could give rise to a better fit
XGBoost and the logic behind the model (concepts about Precision, Recall, and F1 Score)
Feature engineering techniques such as imputation, discretization... (more in the notebooks)
Working with each other on Github and some of the utilities to support remote collaboration.
To not blindly fit dataset into models as some variables can be insanely skewed

Contributors

We are from Lab group SC5, Project Group 1

@HiIAmTzeKean (Data cleaning, Machine model, Conclusions)
@Ki-ann (XGBoost)
@onghaixiang (EDA)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DTreeConfusionMatrix		DTreeConfusionMatrix
EDA		EDA
Presentation		Presentation
XGBConfusionMatrix		XGBConfusionMatrix
.DS_Store		.DS_Store
.gitignore		.gitignore
1. EDA-part1.ipynb		1. EDA-part1.ipynb
1. EDA-part2.ipynb		1. EDA-part2.ipynb
2. Data cleaning and Feature engineering.ipynb		2. Data cleaning and Feature engineering.ipynb
3. Machine model exploration.ipynb		3. Machine model exploration.ipynb
4. XGB model exploration.ipynb		4. XGB model exploration.ipynb
5. Conclusions.ipynb		5. Conclusions.ipynb
Readme.md		Readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to car-resale-analysis repository

About

Problem Definition

Models Used

Presentation

Conclusion

What did we learn from this project?

Contributors

References

About

Contributors 2

Languages

HiIAmTzeKean/SC1015-Car-Resale-Analysis-NTU

Folders and files

Latest commit

History

Repository files navigation

Welcome to car-resale-analysis repository

About

Problem Definition

Models Used

Presentation

Conclusion

What did we learn from this project?

Contributors

References

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages