Skip to content

End-to-end Machine Learning pipeline to classify bank credits for customers based on historical loan data. This project involves data preprocessing, feature selection, training the ML model with custom metrics, finding the best hyperparameters and ensuring model interpretability.

Notifications You must be signed in to change notification settings

miguelmoralh/credit_classification

Repository files navigation

Credit Classification Project

This project tackles a multi-class credit classification task using historical loan data. The pipeline includes data cleaning, imputation of missing values, encoding categorical variables, feature selection (Bivariate Dependence Feature Selection using Normalized Mutual Info and Custom Recursive Feature Elimination using Cross Validation), hyperparameter and model optimization with Optuna, model training, and probability calibration.

The model is evaluated using confusion matrix and explained using SHAP for feature importance insights. The pipeline is adaptable to other datasets with minor modifications.

To train the model, simply run main.py after configuring your dataset. All feature selection, model optimization, and calibration decisions are based on the training set, ensuring no data leakage.

Key scripts:

  • data_cleaning.py: Class that cleans the dataset (type conversion, feature removal).
  • imputer.py: Class to handle missing values (median for numeric, "Missing" for categorical).
  • categorical_encoder.py: Class to encode categorical features (manual mapping for ordinal, LabelEncoder for non ordinal).
  • 01_main_dependence_fs.py: Bivariate feature selection using Normalized Mutual Information.
  • 02_main_rfe_fs.py: Custom Recursive Feature Elimination with cross-validation.
  • 03_main_hyp_opt.py: Hyperparameter and model optimization using Optuna.
  • main.py: Trains and calibrates the model.
  • evaluation.py: Evaluates the trained model (ROC AUC, confusion matrix).
  • model_explainability.py: Explains model predictions using SHAP values.

About

End-to-end Machine Learning pipeline to classify bank credits for customers based on historical loan data. This project involves data preprocessing, feature selection, training the ML model with custom metrics, finding the best hyperparameters and ensuring model interpretability.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages