Skip to content

Sean-T-Buchanan/dataiku-exercise_Jan2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataiku-exercise_Jan2025

This repo contains code and presentation on the take home exercise for the Data Scientist position at Dataiku. Appropriate directories have been created to find relevant pieces of information for modeling purposes such as data/processed which contains a variety of pickel files.

The general architecture of this repo is designed to run the following notebooks in sequence: code-1 (EDA), code-2 (processing the data), code-3 (modeling for XGBoost and LightGBM), and code-4 (inference on test data). Please note that I used Kaggle's free cloud infrastructure for XGBoost modeling and some additional inference.

Both XGBoost and LightGBM were optimized on its F1-score. However, the XGboost models produced higher precision and accuracy, while LightGBM models produced higher recall. In summary, there are models that can be leveraged for different business use-cases. A Majority voting across each of the models produced results comparable to the XGBoost, and a simple weighted model aggreation was explored to determine the impact if a slight bias towards recall was implemented.

Overlapping identified characteristics driving the models performance was observed: industry occupation, type of worker, age, sex (male, female), additional networh (cpatial gains/osses, stocks), # of weeks worked per year, and company size.

Python 3.8 was used. See requirements.txt file for additional set up information.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published