-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmba.Rmd
370 lines (238 loc) · 13 KB
/
mba.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
title: "Market Basket Analysis (Association Rules Mining)"
author: "Gerald Bryan"
date: "11/11/2020"
output:
html_document:
toc: true
toc_float:
collapsed: false
smooth_scroll: true
number_sections: true
theme: flatly
highlight: tango
df_print: paged
---
# Introduction
<br>
## Market Basket Analysis
**Market Basket Anlysis** (MBA) is a data mining technique used by retailers to increase sales by better understanding customer purchasing patterns. When we go to the machine learning terms **Market Basket Analysis** can be categorized as unsupervised learning technique that help to analyzing transactional data. This technique is usually used to analyzing the purchasing pattern of costumers. In example
{T-shirt,Trousers}⇒{Jacket}
The rules above can be states as if someone bought T-shirt and Trousers, then Jacket is also likely to be purchased. From the example above, it is seems that MBA is a very important analysis technique in the retail and sales area, but surprisingly MBA or Association Rules Mining also can be a powerful tools that can be used in many scenario.
In this Example I will try to use MBA as a technique to find the association of The consumption of alcohol by students with a ["student alcoholic consumptions"](https://www.kaggle.com/uciml/student-alcohol-consumption) datasets from kaggle.
## Apriori Algorithm
When we talk about Market Basket Analysis or Association Rules Mining, there is one algorithm that comes to mind which is **Apriori Algorithm**
From the wikipedia it is said that:
>The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
For more information about apriori algorithm you can clik [here](https://www.geeksforgeeks.org/apriori-algorithm/) or [here](https://en.wikipedia.org/wiki/Apriori_algorithm)
```{r, echo=FALSE}
Transaction <- c("T1","T2","T3","T4")
Items <- c("{Tooth brush, Tooth paste, Mouth wash}", "{Jam , Peanut butter, Bread}", "{cereal, milk}","{T-shirt, Trousers}")
a <- data.frame(Transaction,Items)
a
```
In the table above, we can see there is four transactions from a supermarket. The item sets are
$$
\begin{align}
I = {Tooth brush, Tooth paste, Mouth wash, Jam , Peanut butter, Bread, cereal, milk, T-shirt, Trousers}
\end{align}
$$
and the transactions sets,
$$
\begin{align}
T = {T1, T2, T3, T4}
\end{align}
$$
For example,
$$
\begin{align}
T1 = {Tooth brush, Tooth paste, Mouth wash}.
\end{align}
$$
Then the association rules is defined as:
$$
\begin{align}
X⇒Y, where X⊂I, Y⊂I and X∩Y=0
\end{align}
$$
and from the transaction 1 (T1), it can be implies as
$$
\begin{align}
{Tooth brush, Tooth paste} ⇒{Mouth wash}
\end{align}
$$
# Library Packages
```{r, message=FALSE, warning=FALSE}
library(arules) #For Mining Association Rules
library(arulesViz) # For Visualizing Association Rules
library(tidyr) # For Tidying the Data
library(tidyverse) #For Data Manipulation and Visualization (Consist of Multiple R Package)
```
In this project we will use 4 library
- arules : Use for Mining Association Rules
- arulesViz : Use for the visualization of Association Rules
- tidyr : Use for tidying the data
- tidyverse : Use for data manipulation and visualization (Consist of Multiple R Package)
# Data
```{r}
data <- read.csv("data/student-por.csv")
head(data)
```
```{r}
str(data)
```
The school alcoholic consumptions datasets consist of 649 observations and 33 variables originally. When we want to use this data with market basket analysis techniques we must transform all data types into factor. Thus in the next section, I try to transform all variable into factor type and also merge some variables.
# Feature Engineering and Data preparation
If you wonder what feature engineering is, the simplest meaning is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.
For more information you can click [here](https://en.wikipedia.org/wiki/Feature_engineering)
```{r}
#Alcoholic consumption data Transformation
data$alc_cons <- (data$Dalc+data$Walc)/2
data$alc_cons <- ifelse(data$alc_cons>=2.5, "High", "Low")
```
```{r}
#Parents Education Condition
data$parents_ed <- (data$Fedu+data$Medu)/2
data$parents_ed <- ifelse(data$parents_ed>2, "High Education", "Low Education")
```
```{r}
#Grade Transformation
data$grade_imp <- ifelse(data$G1 < data$G3, "Improve", "Not Improve")
data$grade_ave <- (data$G1+data$G2+data$G3)/3
data$grade <- ifelse(data$grade_ave >= 12, "Above Average", "Below Average")
```
```{r}
data$age <- ifelse(data$age >= 19 , "19-22", "15-18")
data$personality <- ifelse(data$freetime>=3 & data$goout>=3, "Extrovert","Introvert")
data$famsize <- ifelse(data$famsize=="GT3", "Big", "Small")
data$like_school <- ifelse(data$absences>=3 & data$failures>2, "Yes","No")
data$ed_support <- ifelse(data$famsup == "yes" | data$schoolsup=="yes", "Yes", "No")
data$failures <- ifelse(data$failures==0, "No","Yes")
data$traveltime <- ifelse(data$traveltime >2, "Long", "Short")
data$famrel <- ifelse(data$famrel >= 3, "Good", "Bad")
data$health <- ifelse(data$health >= 3, "Good", "Bad")
data$address <- ifelse(data$address=="U", "Urban", "Rural")
data$parents_guidance <- ifelse(data$Mjob =="at_home" | data$Fjob=="at_home", "Yes", "No")
data$Pstatus <- ifelse(data$Pstatus=="A", "Apart", "Together")
data$studytime <- ifelse(data$studytime >=3, "Long", "Short")
data$freetime <- ifelse(data$freetime >=3, "Many", "Few")
```
```{r}
data <- data %>%
select(-c(goout,absences,reason,Dalc,Walc,Fjob, Mjob,guardian,G1,G2,G3,grade_ave,schoolsup,famsup,Medu,Fedu))
```
```{r}
data <- data %>%
mutate_if(is.character,as.factor)
data <- data %>%
select_if(is.factor)
```
```{r}
str(data)
```
Ater doing the feature engineering steps where I try to making a new features that can improve the model, removing some unecessary features or variables, and the most important steps is make all the variables to factor types, we can get the "clean" data that will be used to perform association rules/ market basket analysis with 649 observations and 26 variables.
And here is the definition about the variable
- **school : The school students attend (MS : Mousinho da Silveira, GP : Gabriel Pereira)**
- **sex : The gender of students (M : Male, F: Female)**
- **age : Age of the respondent (15-18 and 19-22)**
- **address : The living area of students (Rural, Urban)**
- **famsize : The Family size of students (big = 3 and above person, small = below 3 person)**
- **Pstatus : Parents status (together, apart)**
- **traveltime : How is the respondent travel time to school (Long (30 minutes and longer), Short (below 30 minutes))**
- **studytime : Time that are consumed by student for study**
- **failures : If respondent Have ever fail in class (Yes, No)**
- **paid : If students paid for extra subject of Math or Portuguese (Yes, No)**
- **activities : If students doing extra-curricular activities (Yes, No)**
- **nursery : If students attend nursery school (Yes, No)**
- **higher : If students wants to take higher education (Yes, No)**
- **internet : If students have internet access at home (Yes, No)**
- **romanctic : If students has a romantic relationship (Yes, No)**
- **famrel : The students family relations (Good, Bad)**
- **freetime : Availability free time of the students**
- **health : The students Health Conditions (Good, Bad)**
- **alc_cons : The students alcohol consumption rate (High, Low)**
- **parents_ed : The student's parents education (High, Low)**
- **grade_imp : If the students G1 < G3 it is improve (Improve, Not improve)**
- **grade : If each students 3 grades average is higher than the total score average it is above average (Above average, below average)**
- **personality : Personality of the students (Introvert and Extrovert), based on rate of freetime and going out**
- **like_school : If responednts Like school or not (Yes, No), based on absence and failures**
- **parents_guidance : if either father or mother of the students work at home (Yes, No)**
- **ed_support : If the students have educational support either from parents or school (Yes, No)**
If you wonder what is the "Transaction" and what is the "Items" because you can't find any variables name as "Transaction" and "Items". Don't worry, in this datasets we used "alc_cons" as our "Transactions" variable and the rest of variables as our "Items" variables.
# The Modelling Process
In This part we will try to make the model from the data, we will divided it into two parts, whereas searching for which factor leads to high consumption of alcohol and which factor leads to low consumption of alcohol.
Before we go to the modelling process, it is better if yoou know these terms first:
**1. Support**
**Support** is the percentage of transactions that contain all of the items in an itemset example *T1 = {Item A, Item B}* . The higher the support the more frequently the itemset occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.
and how to calculate **support**,
$$
\begin{align}
Support(Item A\Rightarrow ItemB) &=Pr(ItemA,ItemB)&=\dfrac{count(ItemA,ItemB)}{N}
\end{align}
$$
where N represent the total number of transactions
**2. Confidence**
**Confidence** the probability that a transaction that contains the items on the left hand side of the rule also contains the item on the right hand side. The higher the confidence, the greater chance that the item on the right hand side will be purchased.
and this is how to calculate **confidence**,
$$
\begin{align}
Confidence(ItemA\Rightarrow ItemB) &=\dfrac{support(ItemA,ItemB)}{support(ItemA)}
\end{align}
$$
**3. Lift**
**Lift** is the support divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them.
and this is how to calculate **lift**,
$$
\begin{align}
Lift(A\Rightarrow B) &=\dfrac{support(A,B)}{Pr(A)Pr(B)}&=\dfrac{Pr(A,B)}{Pr(A)Pr(B)}&=\dfrac{Pr(B|A)}{Pr(B)} \end{align}
$$
These are the implications of **lift**
- When lift is 0 - 1, there is no relationship at all.
- When lift is more than 1, the transaction of the item is more likely to happen
- When lift is lower than 0, the transaction of the item is less likely to happen
Now let' do the modelling and try to get the take away from this data sets.
**5.2.1 High Alcohol Consumptions**
```{r,warning= FALSE}
mba_high <- apriori(data, parameter = list(sup = 0.01, conf = 0.5, target="rules",minlen=2,maxlen=3), appearance = list(rhs= "alc_cons=High", default = "lhs"))
```
```{r}
inspect(head(sort(mba_high, by="confidence"),10))
```
From the first rules we can implies, Male students who do not want to take higher education is 3.54%(support) from all the datasets. These category of students likely to have a high consumption of alcohol by 67,64%(confidence). If you are a male student and do not want to take higher education you are 2.27(lift) times more likely to have a high consumption of alcohol.
I also try to visualize the result above,
```{r, warning= FALSE, message=FALSE}
plot(mba_high)
```
```{r}
plot(mba_high[1:10], method = "graph")
```
```{r}
plot(mba_high[1:10], method="graph", control=list(layout=igraph::in_circle()))
```
**5.2.2 Low Alcohol Consumptions**
```{r}
mba_low <- apriori(data, parameter = list(sup = 0.5, conf = 0.7, target="rules",minlen=2,maxlen=3), appearance = list(rhs= "alc_cons=Low", default = "lhs"))
```
```{r}
summary(mba_low)
```
```{r, warning=FALSE, message=FALSE}
plot(mba_low)
```
```{r}
inspect(head(sort(mba_low, by="confidence"),10))
```
From the first rules we can implies, students who do not fail in any class and attend nursery school is 50,69%(support) from all the datasets. These category of students likely to have a low consumption of alcohol by 73,93%(confidence). If you are a student who do not fail in any class and attend nursery school you are 1.05(lift) times more likely to have a low consumption of alcohol.
And, the plot below is the visualization of the low consumption
```{r}
plot(mba_low[1:10], method="graph")
```
```{r}
plot(mba_low[1:10], method="grouped")
```
```{r}
plot(head(sort(mba_low,by="lift"),10),method="graph")
```
# Conclusion
Market basket analysis is a very useful techniques to analyze data. Traditionally it only use for a transaction data but guess what it is not. You can do this technique using all type of datasets but do not forget to change it to a factor data type first. Hopefully this will help you doing your own MBA analysis
**Thank you :)**