-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCredit_Fraud_Analysis.Rmd
161 lines (113 loc) · 6.84 KB
/
Credit_Fraud_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Credit Fraud Detection Analysis"
author: "Arjun Sekhar"
date: "2023-12-26"
output: pdf_document
---
# Introduction
The Credit Fraud Detection data is derived from a project in Kaggle (see link: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data) with the intention of decoding and modelling the transactions from September 2013 by European cardholders. The data itself is drawn from transactions across two calendar days, comprising of 492 fraudulent transactions from 284,807 transactions. Note that this data contains only numeric variables, which are a resultant of a Principal Component Analysis (PCA). This means the original raw data features are unknown for the purposes of this exercise.
# Data Preparation
As preparation we establish some of the packages that will come into use in this analysis. This is outlined in the below chunk of R.
```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(tidyverse)
library(pander)
library(corrplot)
library(caTools)
library(DHARMa)
library(caret)
library(pROC)
library(PRROC)
library(glmnet)
```
The next step involves introducing the `ratings_for_upload.csv` data set using the `read.csv()` function. Our aim here is to analyse and understand the data, as well as the variables that are at our disposal. This will segway into the exploratory data analysis (EDA), allowing us to pursue a model building strategy.
```{r}
# Set the working directory
setwd("/Users/arjunsekhar/OneDrive/Knowledge/Courses/Kaggle/credit-fraud-analysis")
# Input the data
credit <- read.csv('creditcard.csv', header = TRUE)
pander(head(credit))
```
From the data presented, a total of 284807 transactions of 31 variables are recorded, which is a representation of credit card transactions in September 2013 by European credit card holders. As mentioned in the context of this task, the data has been dealt with Principal Component Analysis (PCA) transformations extensively, this limits the extent of information available.
\newpage
# Exploratory Data Analysis (EDA)
## A) Spread of values in 'Class' column
Firstly with the `Class` column, we can alter the summary by identifying the factors as `Non Fraudulent` and `Fraudulent` transactions. Upon doing so, this summary can be presented as a table using `pander`, which is a neater way of displaying the information in LaTeX format.
```{r}
credit$Class <- as.factor(credit$Class)
levels(credit$Class) <- c('Non Fraudulent', 'Fraudulent')
credit_class <- credit %>%
group_by(Class) %>%
summarise(Total = n()) %>%
mutate(Frequency = round(Total/sum(Total), 5)) %>%
arrange(desc(Frequency))
pander(credit_class)
```
From the above we can see how the data provided is unbalanced. Despite the context provided acting as a foreshadow to this exploratory revelation, it can be foreshadowed from an everyday perspective since we would anticipate most transactions to be non fraudulent. Given the intention is to measure the accuracy, the Area Under the Curve (AUC) concept will be applied as the accuracy measure.
## B) Missing Values
The next step is to analyse the proportion of missing values from the context of this data set.
```{r}
# Quantity of Missing Values
credit %>%
is.na() %>%
sum()
```
From the above we can see how there are no missing values. Although such a reading is rare, in this context it can be attributed to the data preparation when the data was provided.
## C) Correlation Matrix
```{r}
credit %>% duplicated() %>% sum()
credit2 <- credit
credit2$Class <- as.numeric(credit2$Class)
credit_correlation <- cor(credit2[], method = 'spearman')
corrplot(credit_correlation, method = 'shade', tl.cex = 0.65)
```
Due to the the author applying the Principal Component Analysis prior to the publishing of this data, it can be seen from the above that most variables are not correlated with one another.
\newpage
# Data Modelling
## A) Splitting the data set into train and test sets
The next step is useful for the purpose of our model building and subsequent analysis. The first goal is to standardise the data, so as to ensure that the values are scaled according to a specific range, negating the existence of extreme values to our analysis. Note that the `set.seed()` is used to ensure data reproducibility using random numbers. Upon doing this, we split the data into train and test data sets, with 80% of values used in the former and 20% in the latter.
```{r}
# Standardising the data
credit$Amount <- scale(credit$Amount)
credit3 <- credit[,c(-1)]
pander(head(credit3))
# Random Number Generator
set.seed(1234)
# Split the data
credit_data <- sample.split(credit3$Class, SplitRatio = 0.8)
train_set <- subset(credit3, credit_data == TRUE)
test_set <- subset(credit3, credit_data == FALSE)
# Dimensions
dim(train_set)
dim(test_set)
```
## B) Fit Logistic Regression Model
The next step involves the fitting of a logistic model using the test data. This is our initial model building method because traditional fraud analytics involves the usage of a binary system to detect card fraud. A Receiver Operating Characteristic (ROC) Curve will be employed in order to investigate the Area Under the Curve (AUC).
```{r, warning=FALSE}
# Logistic Regression using Train Data
credit_log_train <- glm(Class ~ ., data = train_set, family = 'binomial')
summary(credit_log_train)
```
Viewing the plots of this logistic model, we can see the following:
```{r}
## Plots of the Train Data
par(mfrow=c(2,2))
plot(credit_log_train)
```
We can see from the above that both the QQ and the Residuals Plots have the majority of points following a linear trend (although there are values that deviate from normality). Given there are no substantial outliers (from the context of this exercise), we will move into the Test Set in order to transition into the analysis of the Area Under the Curve (AUC) using the Receiver Operating Characteristic (ROC).
```{r, warning=FALSE}
# Logistic Regression using Train Data
credit_log_test <- glm(Class ~ ., data = test_set, family = 'binomial')
summary(credit_log_test)
```
Having now established the model of the test data, we can proceed with visualising the residuals and the normality of this data. Upon doing this step, we can proceed with creating the Receiver Operating Characteristic (ROC) curve. Note that we want this reading to be high, but in order to be able to compare against other algorithms and investigate the model suitability.
```{r}
## Plots of the Test Data
par(mfrow=c(2,2))
plot(credit_log_test)
```
Similar to the train data, the test data shows near similar results. It can be seen that the majority of points follow a linear trend in the tests of normality, but given we are assessing the model against the binary response variable `Class`, we will proceed with visualising with an Area Under the Curve plot.
```{r}
credit_log_prediction <- predict(credit_log_test, newdata = test_set, probability = TRUE)
roc(test_set$Class, credit_log_prediction, plot = TRUE)
```