-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathonline_purchase_propensity.py
288 lines (253 loc) · 11.1 KB
/
online_purchase_propensity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# ---
# jupyter:
# jupytext:
# formats: ipynb,py:percent
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.11.3
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# ---
# %% [md]
# # Online Purchase Propensity
# %% [md]
# ## Project Overview
# %% [md]
# This [Kaggle](https://www.kaggle.com/benpowis/customer-propensity-to-purchase-data) dataset
# tracks a day's worth of user visits for a ficticious online retail website. 607055 rows, 25
# columns, and (excluding "UserID") entirely composed of binary data. Features #2 through #24 track
# various user states and actions across the website, while #25 denotes whether the user session
# resulted in an order. The end goal is to prototype a couple machine learning models to predict
# user propensity to purchase for our hypothetical client.
# %% [md]
# ## Data Exploration
# %%
# module imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sb
from hyperopt import fmin, hp, STATUS_OK, tpe, Trials
from imblearn.ensemble import BalancedRandomForestClassifier
from IPython.display import Markdown, display
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import BernoulliNB
from statsmodels.stats.outliers_influence import variance_inflation_factor
# %%
# markdown printer
def printmd(text):
display(Markdown(text))
# %%
# importing data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
printmd('**Train info:**')
print(train.info(), '\n')
printmd('**Test info:**')
print(test.info())
# %% [md]
# Both dataframes are quite clean, no missing or null values. Dropping the "UserID" column as it
# won't be needed in the remaining analysis.
# %%
# dropping userid
train.drop("UserID", axis = 1, inplace = True)
test.drop("UserID", axis = 1, inplace = True)
# %%
# class balances
printmd('**Train balance:**')
print(train.ordered.value_counts()/train.shape[0], '\n')
printmd('**Test balance:**')
print(test.ordered.value_counts()/test.shape[0])
# %% [md]
# This dataset suffers from heavy class imbalance, something we will need to keep in mind
# when building predictive models sensitive to said imbalance. Note that the test dataset is
# composed entirely of "0" "ordered" instances. This is expected to have a significant impact on
# certain metrics for assessing model performance.
# %%
# correlation plot
fig, ax = plt.subplots()
fig.set_size_inches(10, 8)
sb.heatmap(train.corr(), cmap = 'winter', ax = ax)
ax.set_title('Column Correlation Matrix')
plt.show()
# %% [md]
# The above heatmap helps visualize predictor-predictor and predictor-response correlations at a
# glance. "checked_delivery_detail", "sign_in", and "saw_checkout" all show promising amounts of
# correlation with the response variable, albeit with similarly high levels of multicollinearity
# between themselves.
#
# A more quantitative method of assessing multicollinearity amongst the predictors would be to
# be to calculate variance inflation factors (VIFs).
# %%
# x-y splits
X_train = train.iloc[:, :-1]
y_train = train['ordered']
X_test = test.iloc[:, :-1]
y_test = test['ordered']
# %%
# calculating vifs
vifs = pd.DataFrame(X_train.columns, columns = ['col'])
vifs.loc[:, 'vif'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vifs = vifs.sort_values('vif', ascending = False).set_index('col', drop = True)
print(vifs)
# %% [md]
# The top four variables in this list make up those first to be addressed: "loc_uk",
# "device_mobile", "sign_in", and "saw_checkout". In a much wider dataset, feature engineering and
# reduction might be necessary to wrangle in our model; our current set of 23 features will be kept
# as is.
# %% [md]
# ## Building Models
# %% [md]
# ### 1. BernoulliNB (bnb)
# %% [md]
# The first model will serve as the baseline to beat, a derivative of the naive bayes theorem
# specifically for boolean features. For reasons covered later on, this model should perform well
# despite some hidden gotchas in this particular dataset.
# %%
# model construction
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)
printmd('**Classification Report**')
print(classification_report(y_test, y_pred, digits = 4, zero_division = 0))
# %% [md]
# The impact of only having one class in the test dataset is evident in the class "1" f1 score (0)
# and macro average f1 score (.495). Both metrics suffer with no true positives for class "1";
# false positives make up the only "1" class predictions and therefore tank precision and recall.
# For a better representation of predictive performance on a two-class test dataset, we turn to
# stratified cross validation of the training dataset.
# %%
# validation function
def model_validation(clf, X, y, format = 'report'):
'''
Stratified cross validation function, outputs either classification_report or f1_score.
Arguments:
clf (object): classifier, supports .fit / .predict
X (array): feature array, n x d
y (array): label array, n x 1
format (string): output format, 'report' or 'score'
Returns:
result (string / float): validation metric, depends on format variable
'''
preds = []
tests = []
skf = StratifiedKFold(shuffle = True)
for train_ind, test_ind in skf.split(X, y):
train_X, test_X = X.values[train_ind], X.values[test_ind]
train_y, test_y = y[train_ind], y[test_ind]
clf.fit(train_X, train_y)
pred_y = clf.predict(test_X)
preds.append(pred_y)
tests.append(test_y)
if format == 'report':
result = classification_report(np.hstack(tests), np.hstack(preds), digits = 4)
else:
result = f1_score(np.hstack(tests), np.hstack(preds), average = 'macro')
return result
# %%
# bernoulli validation
printmd('**Classification Report**')
bnb = BernoulliNB()
print(model_validation(bnb, X_train, y_train))
# %% [md]
# Now that the naive bayes model is predicting on instances from both classes, we get a more
# realistic look at predictive performance. Our cross validated model's precision for class "1"
# does leave something to be desired, something we hope to improve with another model.
#
# Above it was mentioned that the heavy class imbalance would impact models sensitive to said
# imbalance. BernoulliNB is a probabilistic, binary classification model (in this specific
# instance), and therefore is mostly unaffected. Any deterministic models we attempt to implement
# will need to take steps to address this imbalance.
# %% [md]
# ### 2. BalancedRandomForestClassifier (brf)
#
# Enter "imbalanced-learn" (also "imblearn"), a Python library built for the explicit purpose
# of dealing with class imbalanced datasets. Touted for it's high degree of compatibility with
# scikit-learn, "imbalanced-learn" contains a hodge podge of resampling and
# sklearn-wrapped methods tailor-built for handling unbalanced datasets.
#
# In contrast to sklearn's RandomForestClassifier, brf includes parameters to control random
# undersampling. Each weak learner in the bagged model is trained on an undersampled (with
# replacement) subset of the original dataset, albeit with a more favorable / equal balance of
# classes.
# %%
# parameter space
params = {
'n_estimators': hp.quniform('n_estimators', 20, 80, 1),
'max_depth': hp.quniform('max_depth', 1, 15, 1),
'min_samples_split': hp.quniform('min_samples_split', 2, 50, 1),
'max_features': hp.quniform('max_features', 1, 23, 1),
'sampling_strategy': hp.choice(
'sampling_strategy',
['majority', 'not minority', 'not majority', 'all']
),
'replacement': hp.choice('replacement', ['True', 'False']),
'class_weight': hp.choice('class_weight', [None, 'balanced', 'balanced_subsample'])
}
# %%
# optimization function
def objective(params):
keys = list(params.keys())
floats = ['n_estimators', 'max_depth', 'min_samples_split', 'max_features']
params = {k:(int(params[k]) if k in floats else params[k]) for k in keys}
brf = BalancedRandomForestClassifier(**params, n_jobs = -1, random_state = 42)
score = model_validation(brf, X_train, y_train, 'score')
return {'loss': -score, 'status': STATUS_OK}
# %% [md]
# To eke out as much predictive performance as possible, we will be relying on HyperOpt for some
# hyperparameter tuning. The fmin function below will test various combinations of
# hyperparameters from a pre-defined parameter space, using outcomes to guide it to the optimal
# set. Compared to GridSearch and RandomSearch, this translates to less time spent testing
# conditions with low chances of improving performance.
#
# One quick warning, this notebook was run on a 24 core system and the below cell took 12 minutes
# and 57 seconds to run. Your execution time will scale based on how your system compares with mine.
# %%
# tuning model
trials = Trials()
best = fmin(fn = objective, space = params, algo = tpe.suggest, max_evals = 50, trials = trials)
sampling_strategy = ['majority', 'not minority', 'not majority', 'all']
replacement = ['True', 'False']
class_weight = [None, 'balanced', 'balanced_subsample']
for name in ['sampling_strategy', 'replacement', 'class_weight']:
best[name] = eval(f'{name}[best[name]]')
for name in ['n_estimators', 'max_depth', 'min_samples_split', 'max_features']:
best[name] = int(best[name])
printmd('**Best Parameters**')
print(best, '\n')
printmd('**Best Macro F1 Score**')
print(-trials.best_trial['result']['loss'])
# %% [md]
# This second model and macro f1 score mark an improvement of ~.055 over the previous. Some notes
# about the best parameter set from our hyperparameter tuning:
#
# 1. the chosen combination of parameters varies quite wildly
# 2. most stable parameters are "n_estimators", "max_depth", and "min_samples_split"
# a. "n_estimators" is usually between 40 and 60
# b. "max_depth" is usually below 10
# c. "min_samples_split" is usually between 15 and 25
# 3. macro f1 score achieved is usually in the .959 to .960 range
#
# Let's generate a classification report so we can compare per class f1 scores.
# %%
# model report
brf = BalancedRandomForestClassifier(**best, n_jobs = -1, random_state = 42)
printmd('**Classification Report**')
print(model_validation(brf, X_train, y_train))
# %% [md]
# ## Closing Thoughts
# %% [md]
# Swapping models resulted in f1 score net-positive changes for both classes, though only minimally
# so for class "0". The gains for class "1" were in precision, meaning more correct "1" predictions
# per overall "1" predictions; recall did suffer, but only minimally so. In a business context this
# would means fewer false positives and more false negatives, respectively, but in general better
# predictive abilities for class "1". Depending on client requirements these specific models could
# be re-weighted, though the end result would be a trade off between each class' predictive power.
#
# Further exploration, post-project, would be to investigate more computationally complex models;
# XGBoost, LightGBM, and Catboost come to mind.