This repo contains code and presentation on the take home exercise for the Data Scientist position at Dataiku. Appropriate directories have been created to find relevant pieces of information for modeling purposes such as data/processed which contains a variety of pickel files.
The general architecture of this repo is designed to run the following notebooks in sequence: code-1 (EDA), code-2 (processing the data), code-3 (modeling for XGBoost and LightGBM), and code-4 (inference on test data). Please note that I used Kaggle's free cloud infrastructure for XGBoost modeling and some additional inference.
Both XGBoost and LightGBM were optimized on its F1-score. However, the XGboost models produced higher precision and accuracy, while LightGBM models produced higher recall. In summary, there are models that can be leveraged for different business use-cases. A Majority voting across each of the models produced results comparable to the XGBoost, and a simple weighted model aggreation was explored to determine the impact if a slight bias towards recall was implemented.
Overlapping identified characteristics driving the models performance was observed: industry occupation, type of worker, age, sex (male, female), additional networh (cpatial gains/osses, stocks), # of weeks worked per year, and company size.
Python 3.8 was used. See requirements.txt file for additional set up information.