This project focuses on fraud detection in retail transactions using the Data Mining Cup 2019 dataset. It includes data analysis, preprocessing, feature engineering, and the evaluation of various machine learning models.
data
folder contains the training and testing data in CSV format.DMC-2019-realclass.csv
contains the real class labels for testing data.train.csv
is the main dataset used for training the models.
- Class distribution: This dataset has a severe class imbalance with fraudulent transactions forming a minority.
- Feature distribution: Investigate the distribution of features like trustLevel, totalScanTimeInSeconds, grandTotal, lineItemVoids, etc.
- Correlation analysis: Explore the correlation between features and the target variable (fraud).
- Data cleaning: Handle missing values and inconsistencies in the data.
- Feature engineering: Create new features like TotalItemsScanned and various ratios.
- Downsampling: Address class imbalance by downsampling the majority class.
- Several machine learning models, including Logistic Regression, Random Forest, Decision Tree, K-Nearest Neighbors, and Gradient Boosting, are trained and evaluated.
- Evaluation metrics include precision, recall, F1-score, ROC AUC, and visualizations.
- The best model is selected and used to make predictions on the test data.
- Real class labels are compared to predicted labels to assess model performance.
-
Clone the repository:
git clone https://github.com/zaza-ipynb/Data-Mining-Cup-2019.git