-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path05_workflows.Rmd
166 lines (126 loc) · 5 KB
/
05_workflows.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "05_workflows"
output: html_document
---
layout: false
class: middle, center, inverse
## 4.4 `workflows`:<br><br>Modeling Workflows
---
background-image: url(https://www.tidymodels.org/images/workflows.png)
background-position: 97.5% 2.5%
background-size: 7%
layout: true
---
## 4.4 `workflows`: Modeling Workflows
`workflows` introduces a `workflow` object which bundles the preprocessing recipe and model specification to reduce code clatter. At the same time, it acts as a single entry point to your modeling pipeline.
```{r}
cls_wf <- workflow(preprocessor = mod_recipe, spec = log_cls)
cls_wf
```
.pull-right[.pull-right[.footnote[
*Note: You may even bundle (and later explore) various different combinations of preprocessing recipes and model specifications using the `workflowsets` package.*
]]]
???
- if no preprocessing is required (because the data is already perfect) `add_formula()` could be used (but you can only ever use one of the two)
- in later releases, `workflows` should also be able to encapsulate post-processing steps (e.g., modifying the probability cutoff for binary classification; or calibration of probabilities; or determination of euqivocal zones)
---
## 4.4 `workflows`: Modeling Workflows
When calling `fit()` on a `workflow` object, `tidymodels` performs the following steps for us:
1. It fits the `recipe` object to the training set and produces the in-sample estimates (`prep()`).
2. It applies the fitted recipe to the training set to process the predictors (`bake()`).
3. It trains the specified model on the transformed training set (`fit()`/`fit_xy()`).
```{r, results='hide'}
cls_wf_fit <- cls_wf %>%
fit(train_set)
cls_wf_fit
```
```
> Output on next slide
```
???
- workflows abstract away the need for `prep` and `bake`
---
## 4.4 `workflows`: Modeling Workflows
```
> == Workflow [trained] =========================================================================
> Preprocessor: Recipe
> Model: logistic_reg()
>
> -- Preprocessor -------------------------------------------------------------------------------
> 5 Recipe Steps
> * step_medianimpute() * step_dummy()
> * step_normalize() * step_smote()
> * step_other()
>
> -- Model --------------------------------------------------------------------------------------
> Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
>
> Coefficients:
> (Intercept) year age
> -2.76543 -0.44060 0.05288
> peak_name_Cho.Oyu peak_name_Everest peak_name_Manaslu
> 0.09953 1.19280 1.41512
> peak_name_other season_Spring ...
> 1.31090 0.02033 ...
>
> Degrees of Freedom: 96469 Total (i.e. Null); 96447 Residual
> Null Deviance: 127600
> Residual Deviance: 110100 AIC: 110200
```
???
output abbreviated
---
## 4.4 `workflows`: Modeling Workflows
Again, after having fitted the workflow, we can proceed to predicting the response in the test data. When calling `predict()` on a `workflow` object, `tidymodels` performs the following steps for us:
1. It applies the fitted recipe to the test set to process the predictors (`bake()`).
2. It applies the trained model to the preprocessed test set predictors to generate predictions (`predict()`).
```{r}
cls_wf_fit %>%
predict(new_data = test_set, type = "prob") %>%
glimpse
```
.footnote[
*Note: Call `extract_fit_engine()` or `extract_recipe()` to extract the fitted model or the estimated `recipe` object from the workflow.*
]
---
name: example-no-resampling
## Tidymodels: A Complete Example
**Step 1:** Split data into training and test set using `rsample`.
```{r}
set.seed(2021)
climbers_split <- initial_split(climbers_df, prop = 0.8, strata = died)
train_set <- training(climbers_split)
test_set <- testing(climbers_split)
```
**Step 2:** Define the relevant preprocessing steps using `recipe`.
```{r}
rec <- recipe(formula = died ~ ., data = train_set) %>%
update_role(member_id, new_role = "id") %>%
step_impute_median(age) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(peak_name, citizenship, expedition_role, threshold = 0.05) %>%
step_dummy(all_predictors(), -all_numeric(), one_hot = F) %>%
themis::step_upsample(died, over_ratio = 0.2, seed = 2021, skip = T)
```
**Step 3:** Specify the desired machine learning model using `parsnip`.
```{r}
log_cls <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
```
---
## Tidymodels: A Complete Example
**Step 4:** Bring everything together using `workflows`.
```{r}
cls_wf <- workflow() %>%
add_recipe(rec) %>%
add_model(log_cls)
```
**Step 5:** Train the workflow (i.e. recipe plus model) and use it for prediction.
```{r}
cls_wf_fit <- cls_wf %>%
fit(train_set)
cls_wf_fit %>%
predict(new_data = test_set, type = "prob") %>%
glimpse
```