-
Notifications
You must be signed in to change notification settings - Fork 3
/
creditcard.Rmd
123 lines (96 loc) · 2.76 KB
/
creditcard.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "R Notebook"
output: html_notebook
---
```{r}
required_packages <- c(
# Add to this list the packages that you will use - if unavailable, it will be
# automatically installed
"readr",
"data.table",
"dplyr",
"ggplot2",
"tidyr",
"purrr",
"broom"
)
packages_to_install = required_packages[!(required_packages %in%
installed.packages()[, 1])]
if (length(packages_to_install) > 0) {
install.packages(packages_to_install)
}
suppressPackageStartupMessages({
sapply(required_packages, require, character.only = TRUE)
})
```
```{r}
df <- fread("creditcard.csv")
```
```{r}
df <- df %>%
data.frame() %>%
mutate(Class = as.factor(Class))
```
```{r}
#Set seed for random sampling
set.seed(42)
trainIndex <- createDataPartition(df$Class,
p = 0.8, #Proportion of training data
list = FALSE,
times = 1)
df_train <- df[trainIndex,]
df_test <- df[-trainIndex,]
```
```{r}
df <- df %>%
mutate(A1 = ifelse(V1 < 2,1,0),
A2 = ifelse(V2 > 1.5,1,0),
A3 = ifelse(V3 < 2,1,0),
A4 = ifelse(V4 > 2.5,1,0),
A5 = ifelse(V5 < -1.7,1,0),
A6 = ifelse(V6 < -1.7,1,0),
A7 = ifelse(V7 < -1.8,1,0),
A9 = ifelse(V9 < -2,1,0),
A10 = ifelse(V10 < -1.5,1,0),
A11 = ifelse(V11 > 2.5,1,0),
A12 = ifelse(V12 < -2.5,1,0),
A14 = ifelse(V14 < -1.5,1,0),
A16 = ifelse(V16 < -2,1,0),
A17 = ifelse(V17 < -1,1,0),
A18 = ifelse(V18 < -2.5,1,0),
A19 = ifelse(V19 > 2,1,0),
A21 = ifelse(V21 > 0.7,1,0))
```
Let us plot the distribution of the `class` variable:
```{r}
ggplot(df) +
geom_bar(aes(x = Class), width = 0.5) +
ggtitle("Distribution of classes") +
theme(plot.title = element_text(hjust = 0.5))
table(df$Class)
```
```{r}
df <- df %>%
select(-V8,
-V13,
-V15,
-V20,
-V22,
-V23,
-V24,
-V25,
-V26,
-V27,
-V28)
```
```{r}
lul <- df %>%
group_by(Class) %>%
nest() %>%
mutate(qs = map(data, t(quantile(df$V1, probs = c(0.5,0.75)))))
```
It is clear that the class is highly imbalanced. Therefore, it would be necessary to resample the dataset in such a way that the classes are balanced. There are a few ways to go about this. One option is doing undersampling (remove instances from the majority class to make the dataset balanced) or oversampling (replicate instances from the minority class to make the dataset balanced).
Let us begin by splitting our dataset into test and train. We use the `mlr` package in R for this exercise.
```{r}
#
```