Computer vision techniques to identify digits from a dataset of tens of thousands of handwritten images.
https://www.kaggle.com/c/digit-recognizer
• Load Data
• Check for null/missing values
• Check for unbalanced labels
• Data normalization
• Label encoding (One Hot Encoding to convert categorical variables to one hot vectors)
• Split training and validation sets
• Multiple Linear Regression
• Support Vector Machine (SVM) with Principal Component Analysis (PCA)
• eXtreme Gradient Boosting (XGBoost) with parameter tuning
• Random Forest Classifier
• K Nearest Neighbors Classifier (KNN) with Principal Component Analysis (PCA)
Evaluation performed based on both the F1 score and the deduced accuracy of each model on the validation data.
Performance measured as the accuracy on validation data per model:
- SVM with PCA : 97.9%
- KNN with PCA : 97.6%
- XGBoost with parameter tuning : 96.2%
- Random Forest Classifier : 88.7%
- Multiple Linear Regression : 85.1%
The above values are indicative in the sense that they highly depend on the selection of parameters such as the PCA component range, the seed in KNN, the number of estimators in Random Forest and many others.