Created a model that can classify an employee promotion with (89.49% Accuracy).
Used two datasets: PromosSample.csv as historical data with 54808 examples and test.csv as current data 23490 examples using pandas library in python.
Applied Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Naive Bayes, and KNeighborsClassifier and optimized using GridSearchCV to find the best model.
Python version: Python 3.7.11
Packages: pandas, o seaborn, matplotlib, numpy, scikit-learn, pickle, and SMOTE
Machine Learning Yearning by Andrew Ng
Calculating the missing value ratio
Binary classification project with similar dataset
Binary classification project with similar dataset
Used historical dataset with 13 columns:
Column name | Variable type |
---|---|
employee_id | Numerical |
department | Categorical |
region | Categorical |
education | Categorical |
gender | Categorical |
recruitment_channel | Categorical |
no_of_trainings | Numerical |
age | Numerical |
previous_year_rating | Numerical |
length_of_service | Numerical |
awards_won? | Categorical |
avg_training_score | Numerical |
is_promoted | Categorical |
After pulling the data, I cleaned up the both datasets (historical and current) to reduce noise in the datasets. The changes were made follows:
- Removed duplicates if there are any based on the "employee_id" column,
- Checked null values and their ratio, and none of the variables are removed since their ratio is quite small,
- Filled missing values of 'previous_year_rating' with mean based on 'awards_won?', 'education' and 'recruitment_channel' with most frequent value based on 'department', and 'gender' with mode based on 'awards_won? in the historical dataset,
- Filled missing values of 'previous_year_rating' on the current dataset with mean based on 'awards_won?' from the historical dataset, and 'education' on the current dataset with most frequent value based on 'department' from historical dataset,
- Replaced Bachelors with Bachelor's in 'education' column for consistency,
Before:
After:
- Replaced FEMALE, Female, and female variables with f and MALE, Male, and male with m in 'gender' column for consistency,
Before:
After:
Visualized the cleaned data to see the trends.
-
Created Donut chart for is_promoted data. It looks like the data is imbalanced and needs to be balanced.
-
Created pie charts and stacked bar chart for categorical variables: department variable:
- Created bar graphs and stacked bar chart for numerical variables:
previous_year_rating variable:
Categorical variables are encoded, numerical ones are normalized, and 'employee_id' variable is removed from both datasets.
Data were balanced by applying SMOTE, and visualized by the donut chart:
Data were split into train (80%) and test (20%) sets.
I used six models (Decision Tree Classifier, Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Bayes, and KNeighborsClassifier) to predict the sentiment and evaluated them by using Accuracy.
I used six models (Decision Tree Classifier, Logistic Regression, Support Vector Classifier, Random Forest Classifier, Bernoulli Bayes, and KNeighborsClassifier) to predict the sentiment and evaluated them by using Accuracy.
Random Forest Classifier model performed better than any other models in this project but after tuning the parameters the accuracy dropped to 78% so that's why I used Decision Tree model as it is the second-highest score.
Model | Cross Validation Accuracy Score |
---|---|
Decision Tree | 0.9272768163134083 |
Logistic Regression | 0.7748373591096378 |
Support Vector Classifier | 0.844780678076086 |
Random Forest Classifier | 0.9496770892385463 |
Naive Bayes | 0.6916930889263052 |
K-Neighbots | 0.8362169216009582 |
I got the best accuracy 88.43% with GridSearchCV and find the optimal hyperparameters.
Applied Decision Tree model with the optimal hyperparameters and got 89.49% Test Accuracy score.
This model is pickled so that it is saved on disk as model_file.p.
'previous_year_rating', 'avg_training_score', and 'length_of_service' features have the most impact when it comes to deciding if the employee gets promoted or not.
When we apply our model to current dataset, we can expect 13902 current employees to get promotions while 9588 employees do not get, the donut chart shows the distribution of getting promoted.
The Confusion Matrix below shows that our model needs to be improved to predict promotions better.
We estimate the bias as 8.05% and variance as 1.46% (10.51-8.05). This classifier is fitting the training set poorly with 8.05% error, but its error on the test set is barely higher than the training error.
The classifier therefore has high bias, but low variance.
We can say that the algorithm is underfitting.
Data | Accuracy Score (%) | Error (%) |
---|---|---|
Training | 91.95 | 8.05 |
Test | 89.49 | 10.51 |
Adding input features might help to reduce bias on the model.
Error Analysis should be performed to understand the underlying causes of the error (missclassification).
Thanks for reading :)