This repo contains the Machine Learning Project of the neuefische Data Science, Machine Learning & AI Bootcamp 2024 in Hamburg. The team members are:
- Tetyana Samoylenko: https://github.com/TetyanaSam
- Christian Reimann: https://github.com/christian-reimann
- Jakob Koscholke: https://github.com/jottemka
Our goal is to develop a Fraud-Detection System for the The Tunisian Company of Electricity and Gas (STEG), a public and a non-administrative company, responsible for delivering electricity and gas across Tunisia. The company suffered tremendous losses in the order of 200 million Tunisian Dinars due to fraudulent manipulations of meters by consumers. Using the client’s billing history, we want to provide a data product with the following key aspects:
- Goal: identify clients involved in fraudulent activities,leave non-fraudulent clients aside
- Value of Product: enhance the company’s revenues by reducing the losses caused by fraudulent activities, avoid reputation damage
- Evaluation Metric: ROC-AUC, but True Positive Rate and True Negative Rate will also be inspected
The data for this project can be found on Zindi:
The following column documentation was provided by the STEG. Unfortunately, it is not a very good documentation. Some columns are left unexplained, and most explanations are not helpful.
- Client_id: Unique id for client
- District: District where the client is
- Client_catg: Category client belongs to
- Region: Area where the client is
- Creation_date: Date client joined
- Target: fraud:1 , not fraud: 0
- Invoice_date: Date of the invoice
- Tarif_type: Type of tax
- Counter_number:
- Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
- Counter_code:
- Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
- Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
- Consommation_level_1: Consumption_level_1
- Consommation_level_2: Consumption_level_2
- Consommation_level_3: Consumption_level_3
- Consommation_level_4: Consumption_level_4
- Old_index: Old index
- New_index: New index
- Months_number: Month number
- Counter_type: Type of counter
Model | ROC-AUC | True Positive Rate | True Negative Rate | Elapsed Time in Seconds |
---|---|---|---|---|
"Decision Tree" | 0.813529 | 0.824833 | 0.801415 | 6.615412 |
"Random Forest" | 0.867407 | 0.792444 | 0.771914 | 52.600034 |
"Extra Trees" | 0.862694 | 0.784062 | 0.767552 | 50.269403 |
"AdaBoost" | 0.659015 | 0.625867 | 0.605927 | 34.564976 |
"LightGBM" | 0.727527 | 0.674011 | 0.649917 | 4.244878 |
"XGBoost" | 0.763917 | 0.691182 | 0.687353 | 2.807551 |
"CatBoost" | 0.782227 | 0.711931 | 0.695365 | 47.220879 |
"Naive Bayes" | 0.598696 | 0.967253 | 0.08631 | 1.98632 |
"Logistic Regression" | 0.623952 | 0.596349 | 0.58452 | 3.465537 |
Excluded models due to long processing time or excessive effort:
- K-Nearest Neighbors (sklearn and faiss)
- Support Vector Machines
- Deep Neural Networks
Model | ROC-AUC | True Positive Rate | True Negative Rate | min_samples_split | min_samples_leaf | max_depth | criterion | n_jobs | n_estimators | min_child_weight | learning_rate |
---|---|---|---|---|---|---|---|---|---|---|---|
"DecisionTree" | 0.863855 | 0.806765 | 0.797137 | 16 | 1 | 46 | "gini" | null | null | null | null |
"RandomForest" | 0.847049 | 0.772724 | 0.750683 | 10 | 1 | 26 | null | -1 | 225 | null | null |
"ExtraTrees" | 0.793471 | 0.716113 | 0.708899 | 19 | 1 | 26 | "entropy" | -1 | 225 | null | null |
"XGBoost" | 0.888584 | 0.829137 | 0.778349 | null | null | 11 | null | -1 | 441 | 1 | 0.406 |
-
Decision Tree: this model strikes a nice balance between a high ROC-AUC score while also making sure that in many cases fraudulent activity and non-fraudulent activity are detected as such.
-
XGBoost: this model achieved an even higher ROC-AUC score, compromising on the True Negative Rate
For installing the virtual environment you can either use the Makefile and run make setup
or install it manually with the following commands:
make setup
After that active your environment by following commands:
source .venv/bin/activate
Or install the virtual environment and the required packages by following commands:
pyenv local 3.11.3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Install the virtual environment and the required packages by following commands.
For PowerShell
CLI :
pyenv local 3.11.3
python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install -r requirements.txt
For Git-bash
CLI :
pyenv local 3.11.3
python -m venv .venv
source .venv/Scripts/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
If you encounter an error when trying to run pip install --upgrade pip
, try using the following command:
python.exe -m pip install --upgrade pip