Credit card fraud is a critical challenge faced by financial institutions worldwide. This project leverages machine learning techniques to classify transactions as fraudulent or legitimate. The models are trained on anonymized transaction data to ensure privacy while achieving high accuracy and reliability.
- Source: Kaggle
- Size: 280K+ transactions with labeled data ('Fraud' or 'Legit').
- Features: Includes numerical features (V1-V28), transaction amount, and class labels.
- Programming Language: Python
- Libraries:
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- Tools:
- Jupyter Notebook for analysis and model development.
- GitHub for version control.
-
Data Preprocessing:
- Handling class imbalance using the undersampling technique.
- Feature scaling to standardize transaction amounts and anonymized variables.
-
Modeling:
- Logistic Regression (baseline model)
- Random Forest (hyperparameters tuned model)
-
Evaluation:
- Stratified splitting to ensure balanced class representation in training/testing sets.
- Metrics: Area Under Precision-Recall Curve (AUPRC) score, Precision, and Recall.
-
Visualization:
- Confusion Matrix Correlation Heatmap, Feature Importance Bar Chart
- AUPRC Score: 0.99
- Precision: 0.96
- Recall: 0.94
These metrics ensure minimal false positives and negatives, crucial in fraud detection scenarios.
-
Exploratory Data Analysis (EDA): Understand the dataset distribution and detect patterns in fraudulent transactions.
-
Data Preprocessing: Handle missing values, class imbalance, and scale features.
-
Model Training: Train baseline model with Logistic Regression and Random Forest model with hyperparameter tuning.
-
Model Evaluation: Evaluate using precision, recall, and AUPRC score.
-
Optimization: Fine-tune parameters and optimize for deployment readiness.