An XGBoost-based fraud detection model to identify money laundering in mobile transactions using the PaySim synthetic dataset.
Fraud detection in financial transactions is crucial for preventing financial losses and maintaining trust in financial systems. This project leverages the power of XGBoost, a scalable and efficient gradient boosting framework, to build a robust model capable of identifying fraudulent transactions effectively.
Using the PaySim synthetic dataset, which simulates mobile transactions, this project encompasses the entire machine learning pipeline:
- Data Preprocessing
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Training and Hyperparameter Tuning
- Model Evaluation
- Deployment Preparation
The PaySim dataset is a synthetic simulation of mobile money transactions, which includes information about both legitimate and fraudulent transactions. It is designed to mimic real-world mobile banking transactions, making it ideal for training and testing fraud detection models.
Key Features:
- Id: Transaction identifier
- step: Hours elapsed since the start of the simulation
- type: Transaction type (e.g., CASH_IN, CASH_OUT, TRANSFER)
- amount: Transaction amount
- nameOrig: Customer identifier initiating the transaction
- oldBalanceOrig: Original balance before the transaction
- newBalanceOrig: New balance after the transaction
- nameDest: Recipient identifier
- oldBalanceDest: Original balance of the recipient before the transaction
- newBalanceDest: New balance of the recipient after the transaction
- isFraud: Indicator of fraudulent transaction (1 for fraud, 0 otherwise)
- isFlaggedFraud: Indicator if the transaction was flagged as fraud
The project includes extensive feature engineering to enhance model performance, such as:
- Time-Based Features: Hour of day, day of week, and cumulative time features.
- Ratio Features: Transaction amount ratios relative to original and new balances.
- Cumulative Features: Counts and cumulative amounts of transactions per user.
- Aggregated Features: Counts and amounts of transactions per step and day for both originators and recipients.
- Categorical Encoding: One-hot encoding for categorical features like transaction type and recipient type.
- Python 3.7 or higher
- pip
git clone https://github.com/bartublack/PaySim-Fraud-Detection-XGBoost.git
cd PaySim-Fraud-Detection-XGBoost
Install the required packages using the requirements.txt
file:
pip install -r requirements.txt
Alternatively, you can install the dependencies manually:
pip install matplotlib==3.8.4 numpy==2.2.1 pandas==2.2.3 scikit_learn==1.3.1 seaborn==0.13.2 xgboost==2.1.3
The project includes a comprehensive Jupyter Notebook PaySim_Fraud_Detection_XGBoost.ipynb
that demonstrates the entire workflow from data loading to model evaluation.
-
Launch Jupyter Notebook:
jupyter notebook
-
Open the Notebook:
Navigate to
PaySim_Fraud_Detection_XGBoost.ipynb
in your browser and run the cells sequentially.
For a more streamlined approach, you can run the Python script paysim_fraud_detection_xgboost.py
, which includes data preprocessing, feature engineering, model training, and evaluation.
-
Execute the Script:
python paysim_fraud_detection_xgboost.py
After training, the model achieves high accuracy and AUC scores, effectively distinguishing between fraudulent and legitimate transactions. The project also generates several visualization plots saved in the images
directory, including:
- Feature Importance Charts
- Confusion Matrices
- Decision Tree Structures
These visuals aid in understanding model performance and feature relevance.
PaySim-Fraud-Detection-XGBoost/
│
├── data/
│ └── train.csv # Raw dataset
│
├── images/
│ ├── base_feature_importance.png
│ ├── confusion_matrix.png
│ ├── feature_engineered_confusion_matrix.png
│ ├── final_feature_importance.png
│ ├── decision_tree_structure.png
│ └── cross_validation_confusion_matrix.png
│
├── paysim_fraud_detection_xgboost.py # Main script for data processing and model training
├── PaySim_Fraud_Detection_XGBoost.ipynb # Jupyter Notebook
├── requirements.txt # Python dependencies
├── README.md
├── LICENSE
└── .gitignore