This project focuses on detecting fraudulent credit card transactions using machine learning techniques. The goal is to predict whether a given transaction is legitimate or fraudulent based on various features of the transaction. The dataset used in this project includes anonymized features for privacy, such as the transaction amount, time, and other factors that may contribute to identifying fraudulent activities.
- Project Overview
- Project Structure
- Setup and Installation
- Dependencies
- Running the Project
- Model Evaluation
- Conclusion
- Handling Imbalanced Dataset with SMOTE
- Contact
In this project, machine learning models are applied to a real-world dataset of credit card transactions to detect fraud. The entire process follows a typical data science pipeline:
- Data Loading and Exploration
- Data Preprocessing (Handling missing values, scaling, etc.)
- Handling Imbalanced Dataset (SMOTE)
- Model Training (Logistic Regression, Random Forest)
- Evaluation (Accuracy, Precision, Recall, Confusion Matrix)
At the end of the project, we obtain a trained model, performance evaluation metrics, and a detailed report summarizing the results.
├── data/ # Data files
│ ├── raw/ # Raw data files
│ └── processed/ # Processed data files
├── notebooks/ # Jupyter notebooks for exploratory analysis
├── src/ # Source code for the project
│ ├── data_loader.py # Functions for loading the data
│ ├── preprocess.py # Functions for data preprocessing
│ ├── model.py # Functions for training models
│ ├── evaluate.py # Functions for evaluating the model
│ ├── utils.py # Utility functions for data handling
├── evaluation_report.txt # Evaluation results and interpretation
├── requirements.txt # List of dependencies
├── main.py # Main script to execute the project
└── README.md # Project overview and documentation
To get started with the project, follow the steps below:
git clone https://github.com/marcellin-d/Fraud-Detection-in-Online-Transactions.git
cd Fraud-Detection-in-Online-Transactions
python -m venv venv
-
On Windows:
venv\Scripts\activate
-
On macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt
The following Python libraries are required to run this project:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- imblearn (for SMOTE)
To install all dependencies at once:
pip install -r requirements.txt
To run the project and generate results, execute the main.py
script. This script will handle the entire pipeline from data loading to model evaluation.
python main.py
The script will output the following:
- Data loading confirmation, including the shape of the dataset.
- Data preprocessing steps, such as handling missing values and scaling the "Amount" column.
- Model training results for Logistic Regression and Random Forest classifiers.
- Evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix.
After running the model, review the detailed evaluation in the evaluation_report.txt
file. The evaluation includes:
- Accuracy: 94.34%
- Precision for both classes (fraud and non-fraud)
- Recall for both classes
- F1-Score for a balanced measure of precision and recall
- Confusion Matrix: Provides insights into true positives, false positives, true negatives, and false negatives.
Accuracy: 0.9434
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.92 | 0.97 | 0.94 | 56463 |
1 | 0.97 | 0.91 | 0.94 | 56839 |
Accuracy | 0.94 | 113302 | ||
Macro avg | 0.95 | 0.94 | 0.94 | 113302 |
Weighted avg | 0.95 | 0.94 | 0.94 | 113302 |
Predicted 0 | Predicted 1 | |
---|---|---|
Actual 0 | 55008 | 1455 |
Actual 1 | 4955 | 51884 |
This project demonstrates the use of machine learning to tackle the problem of credit card fraud detection. By utilizing models like Logistic Regression and Random Forest, we can identify fraudulent transactions with an impressive accuracy of 94.34%. The provided evaluation metrics give a detailed view of how the model performs, helping improve fraud detection systems in real-world scenarios.
One of the main challenges encountered in this project was the imbalanced dataset. The dataset contains far more non-fraudulent transactions (Class 0) than fraudulent transactions (Class 1). This imbalance can lead to biased models that predict the majority class more frequently, undermining the detection of fraud.
To address this issue, we employed SMOTE (Synthetic Minority Over-sampling Technique) from the imblearn
library. SMOTE generates synthetic samples of the minority class (fraudulent transactions) by interpolating between existing examples, thereby balancing the dataset and improving the model's ability to correctly identify fraudulent transactions.
Here is a code snippet demonstrating how SMOTE is applied to balance the dataset:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
# Load your dataset
X = data.drop('Class', axis=1) # Features
y = data['Class'] # Target variable
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)
# Now X_res and y_res are balanced and ready for model training
This technique helps ensure the model doesn't become biased toward predicting the majority class (non-fraudulent transactions), ultimately improving fraud detection performance.
For questions or suggestions, feel free to reach out:
- Name: Marcellin DJAMBO
- Email: djambomarcellin@gmail.com
- LinkedIn: My LinkedIn Profile