Credit Card Fraud Detection with Interpretable Bayesian Network

Overview

This project develops a credit card fraud detection system using a Bayesian Network, prioritizing interpretability and adaptability. Unlike "black-box" machine learning models, a Bayesian Network provides a probabilistic and graphical representation of the relationships between transaction features and the likelihood of fraud. This allows for not only prediction but also an explanation of why a transaction is flagged as potentially fraudulent.

The dataset used is the publicly available Kaggle Credit Card Fraud Dataset, which contains anonymized credit card transactions pre-processed with Principal Component Analysis (PCA). To further enhance fraud detection capabilities, the project also integrates an XGBoost classifier for performance comparison.

This approach is designed to be domain-agnostic, meaning the methodology can be adapted to other anomaly detection tasks beyond fraud detection.

Project Goals

Develop a robust fraud detection system combining Bayesian Networks and XGBoost.
Prioritize interpretability to understand the factors driving fraud predictions.
Create a domain-agnostic pipeline adaptable to different datasets.
Handle severe class imbalance (0.172% fraud cases) using SMOTE-NC.
Demonstrate a complete data science workflow, from preprocessing to model evaluation.
Explore potential causal relationships using do-calculus and Bayesian inference.

Methodology

The pipeline follows a structured, iterative approach:

1. Data Ingestion and Preprocessing (`01_data_ingestion_and_preprocessing.py`)

Loads the creditcard.csv dataset.
Standardizes numerical features (V1 to V28, Time, Amount) using StandardScaler.
Saves the processed dataset as preprocessed_data.csv.

2. Train/Test Split (`02_train_test_split.py`)

Splits the dataset into 80% training and 20% testing.
Ensures stratified sampling due to class imbalance.
Saves X_train.csv, X_test.csv, y_train.csv, y_test.csv.

3. Feature Engineering & Resampling (Train) (`03_feature_engineering_resampling_train.py`)

Discretizes continuous features using quantile-based binning (pd.qcut).
Creates interaction features between key variables.
Uses Chi-Squared tests to select statistically significant features.
Applies SMOTE-NC to balance class distribution.
Saves processed data as resampled_train_data_no_ohe.csv.

4. Bayesian Network DAG Structure Learning (`04_dag_structure_learning_train.py`)

Uses HillClimbSearch with BicScore to learn a Directed Acyclic Graph (DAG).
Constructs a whitelist of high-impact features to improve DAG accuracy.
Saves the learned Bayesian Network structure (learned_dag_edges.txt).

5. Parameter Learning (`05_parameter_learning_train.py`)

Estimates Conditional Probability Distributions (CPDs) for the learned Bayesian Network.
Updates and saves the model as bayesian_network_model.pkl.

6. Feature Engineering (Test) (`06_feature_engineering_test.py`)

Loads X_test.csv and applies the same bin edges as the training set.
Creates interaction features based on the Bayesian DAG structure.
Saves the processed test data as feature_engineered_test_data_no_ohe.csv.

7. Inference & Evaluation (`07_inference_and_evaluation.py`)

Performs Bayesian inference using Variable Elimination.
Finds the optimal fraud detection threshold using Precision-Recall curves.
Evaluates the AUPRC, Precision, Recall, and F1-score.

8. Visualization (`08_visualization_improvements.py`)

Visualizes the Bayesian DAG using Graphviz.
Plots the Markov Blanket of the Class node.
Generates a Precision-Recall curve (AUPRC).

9. Causal Inference (`09_causal_inference.py`)

Performs causal inference using do-calculus.
Tests intervention effects on fraud probability.
Saves results as causal_inference_results.csv.

10. XGBoost Training & Evaluation (`10_train_xgboost.py`)

Trains an XGBoost model using causally significant features.
Computes SHAP values for interpretability.
Saves the trained model as xgboost_model.pkl.

Results Summary

Bayesian Network Performance (at optimal threshold)

Precision: 0.03
Recall: 0.78
F1-score: 0.06

XGBoost Performance (at optimal threshold)

AUPRC: 0.8103
Precision: 0.87
Recall: 0.78
F1-score: 0.82

Causal Inference Insights

Causal inference results are stored in causal_inference_results.csv.
Bayesian Networks provided insights into feature relationships affecting fraud likelihood.

Future Work

Dynamic Bayesian Networks (DBNs): Extend the model to capture temporal fraud patterns.
Hybrid Model: Combine Bayesian Networks with Neural Networks for fraud detection.
Alternative Discretization: Explore decision-tree-based binning instead of quantile binning.
Ensemble Methods: Combine Bayesian inference with XGBoost predictions.
Integration with Real-Time Systems: Deploy a streaming fraud detection model using Kafka & FastAPI.

Dataset Information

The dataset is available on Kaggle: Credit Card Fraud Detection Dataset

V1-V28: PCA components.
Time: Seconds since first transaction.
Amount: Transaction amount.
Class: 1 = Fraud, 0 = Legitimate.

Installation & Dependencies

Install required packages:

pip install -r requirements.txt

Dependencies:

Python 3.8+
pandas, numpy, scikit-learn, imblearn, pgmpy, networkx, matplotlib, xgboost, shap, scipy, pydot

Additional Graphviz Installation (Required for DAG visualization)

sudo apt-get install graphviz graphviz-dev  # Debian-based systems

Usage

Run the full pipeline:

python main.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection with Interpretable Bayesian Network

Overview

Project Goals

Methodology

1. Data Ingestion and Preprocessing (`01_data_ingestion_and_preprocessing.py`)

2. Train/Test Split (`02_train_test_split.py`)

3. Feature Engineering & Resampling (Train) (`03_feature_engineering_resampling_train.py`)

4. Bayesian Network DAG Structure Learning (`04_dag_structure_learning_train.py`)

5. Parameter Learning (`05_parameter_learning_train.py`)

6. Feature Engineering (Test) (`06_feature_engineering_test.py`)

7. Inference & Evaluation (`07_inference_and_evaluation.py`)

8. Visualization (`08_visualization_improvements.py`)

9. Causal Inference (`09_causal_inference.py`)

10. XGBoost Training & Evaluation (`10_train_xgboost.py`)

Results Summary

Bayesian Network Performance (at optimal threshold)

XGBoost Performance (at optimal threshold)

Causal Inference Insights

Future Work

Dataset Information

Installation & Dependencies

Usage

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
lib		lib
.gitattributes		.gitattributes
.gitignore		.gitignore
011_calibration.py		011_calibration.py
01_data_ingestion_and_preprocessing.py		01_data_ingestion_and_preprocessing.py
02_train_test_split.py		02_train_test_split.py
03_feature_engineering_resampling_train.py		03_feature_engineering_resampling_train.py
04_dag_structure_learning_train.py		04_dag_structure_learning_train.py
05_parameter_learning_train.py		05_parameter_learning_train.py
06_feature_engineering_test.py		06_feature_engineering_test.py
07_inference_and_evaluation.py		07_inference_and_evaluation.py
08_visualization_improvements.py		08_visualization_improvements.py
09_causal_inference.py		09_causal_inference.py
10_train_xgboost.py		10_train_xgboost.py
DistributionOfThreshold.png		DistributionOfThreshold.png
LICENSE		LICENSE
README.md		README.md
bayesian_network_model.pkl		bayesian_network_model.pkl
brainstorming1.py		brainstorming1.py
discretization_info.pkl		discretization_info.pkl
learned_dag_edges.txt		learned_dag_edges.txt
main.py		main.py
np		np
one_hot_encoder.pkl		one_hot_encoder.pkl
promising_features.pkl		promising_features.pkl
requirements.txt		requirements.txt
selected_features.pkl		selected_features.pkl
xgboost_model.pkl		xgboost_model.pkl
xgboost_pr_curve.png		xgboost_pr_curve.png

License

ayazkhan27/fraud_detection_dag

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection with Interpretable Bayesian Network

Overview

Project Goals

Methodology

1. Data Ingestion and Preprocessing (01_data_ingestion_and_preprocessing.py)

2. Train/Test Split (02_train_test_split.py)

3. Feature Engineering & Resampling (Train) (03_feature_engineering_resampling_train.py)

4. Bayesian Network DAG Structure Learning (04_dag_structure_learning_train.py)

5. Parameter Learning (05_parameter_learning_train.py)

6. Feature Engineering (Test) (06_feature_engineering_test.py)

7. Inference & Evaluation (07_inference_and_evaluation.py)

8. Visualization (08_visualization_improvements.py)

9. Causal Inference (09_causal_inference.py)

10. XGBoost Training & Evaluation (10_train_xgboost.py)

Results Summary

Bayesian Network Performance (at optimal threshold)

XGBoost Performance (at optimal threshold)

Causal Inference Insights

Future Work

Dataset Information

Installation & Dependencies

Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Data Ingestion and Preprocessing (`01_data_ingestion_and_preprocessing.py`)

2. Train/Test Split (`02_train_test_split.py`)

3. Feature Engineering & Resampling (Train) (`03_feature_engineering_resampling_train.py`)

4. Bayesian Network DAG Structure Learning (`04_dag_structure_learning_train.py`)

5. Parameter Learning (`05_parameter_learning_train.py`)

6. Feature Engineering (Test) (`06_feature_engineering_test.py`)

7. Inference & Evaluation (`07_inference_and_evaluation.py`)

8. Visualization (`08_visualization_improvements.py`)

9. Causal Inference (`09_causal_inference.py`)

10. XGBoost Training & Evaluation (`10_train_xgboost.py`)

Packages