Credit card firms must detect fraudulent credit card transactions to prevent consumers from being charged for products they did not buy. Data Science can address such a challenge, and its significance, coupled with Machine Learning, cannot be emphasized.
The dataset utilized covers credit card transactions done by European cardholders in September 2013. This dataset contains 492 frauds out of 284,807 transactions over two days. The dataset is unbalanced, with the positive class (frauds) accounting for 0.172 percent of all transactions.
This project demonstrates an end-to-end process for detecting fraudulent credit card transactions using machine learning techniques, with a focus on handling imbalanced datasets. The dataset used is publicly available and contains anonymized credit card transaction records.
- Dataset Overview
- Installation
- Data Exploration
- Fraud Case Distribution
- Proportion Analysis
- Distribution of Time and Amount Features
- Missing Values Check
- Feature Scaling and Engineering
- Scaling of Time and Amount Features
- Outlier Removal Using IQR
- Feature Reordering
- Modeling and Evaluation
- Train-Test Split
- Metrics Used
- Stratified k-Fold Cross-Validation
- Handling Imbalanced Data
- Undersampling with NearMiss
- Oversampling with SMOTE
- Visualizations
- Distribution of Features
- Boxplots
- Feature Distributions by Class
- ROC Curves
- How to Run
- Conclusion
The dataset contains 31 columns:
- Time: Time elapsed since the first transaction.
- Amount: Transaction amount.
- V1-V28: Principal components derived from a PCA transformation to anonymize data.
- Class: Target variable (0 = Genuine, 1 = Fraudulent).
The dataset is highly imbalanced, with fraudulent transactions comprising a small fraction of the total.
Ensure you have Python installed (>= 3.8). Install the required dependencies:
pip install sklearn==0.24.2 imbalanced-learn numpy pandas matplotlib seaborn
- Fraudulent and valid transactions were counted and compared.
- Proportion of fraudulent transactions was calculated.
- KDE plots were created to visualize the distribution of
Time
andAmount
.
- Confirmed there were no missing values in the dataset.
- Used
RobustScaler
to scaleTime
andAmount
to reduce the impact of outliers.
- Outliers were removed using the IQR method.
Time
andAmount
were placed at the beginning of the dataset for easier access.
- Dataset was split into 80% training and 20% testing subsets.
- Accuracy, Precision, Recall, F1-Score, and AUC-ROC were used for evaluation.
- Stratified k-fold cross-validation was applied for robust model evaluation.
NearMiss
undersampling technique was used to reduce the size of the majority class.
SMOTE
oversampling technique was employed to synthetically increase the size of the minority class.
- KDE plots for
Time
andAmount
. - Boxplots for features.
- Distributions of all features for Fraudulent vs Genuine transactions.
- ROC curves for various models.
- Clone the repository:
git clone https://github.com/your-repository/credit-card-fraud-detection.git cd credit-card-fraud-detection
Run the script:
python cc_code.py
This project demonstrated effective techniques for handling imbalanced data and building a robust fraud detection system. It highlights the importance of data preprocessing, feature engineering, and evaluation strategies for real-world machine learning applications.