-
Notifications
You must be signed in to change notification settings - Fork 11
/
datetime-extraction.qmd
192 lines (145 loc) · 6.53 KB
/
datetime-extraction.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
pagetitle: "Feature Engineering A-Z | Datetime Value Extraction"
---
# Value Extraction {#sec-datetime-extraction}
::: {style="visibility: hidden; height: 0px;"}
## Value Extraction
:::
Sometimes date variables come in many data sets and problems.
The presence of a date time variable doesn't mean that we are working with a time series data set.
If you are, then read more about how to handle that in the [Time Series section](time-series).
All the extract methods are not learned, as there isn't anything to estimate.
Dates and date times are typically stored as integers.
With dates represented as the number of days since 1970-01-01, with negative values for earlier dates.
with date times representing the number of seconds since 1970-01-01.
This means that if you are lucky and the model you were trying to fit knew to convert dates and date times into integers, that "time since 1970-01-01" would be a helpful predictor.
If not then we need to do some more advanced work.
We have assumed for this chapter that we already have the date variables in the right format.
This would typically be
``` r
YYYY‐MM‐DD
```
for dates, and
``` r
YYYY‐MM‐DD HH:MM:SS
```
for date times.
::: callout-note
There are other formats, and it isn't too big of a deal, as long as the libraries you are using can handle them.
:::
There is also the wrinkle concerning time zones, leap years and leap seconds.
Many rules can mess you up.
It is for this reason that we recommend that you use a trusted datetime library to do the following calculations.
::: callout-caution
# TODO
find a good reference to messed up time, for the above paragraph
:::
Most libraries allow us to pull out standard measurements like
- year
- month
- day
- hour
- minute
- second
but we can include a couple more.
`quarter`, `semester`, `week`.
These are highly related to the above list, they are just in a different format.
`season` is another possibly nice feature, but we have to be careful as the seasons flip depending on where on the globe we are.
In our above list of features, each of them counts up from 1, until we reach the level level.
So minutes stop at 60 and start over.
This might not be what we want, so we can include finer detail by extracting `seconds in day`, `days in week`, and `days in year`.
It will be up to you to figure out if these are useful for you.
Generally, these become useful in the broader sense.
`seconds in day` is more finely grained than `hour`, so if you want to figure out the time of day, then the smaller measurements are helpful.
Likewise, some of these methods are better expressed as decimals, as periods such as `month` have different lengths, and will thus be different.
Talking about being `0.9` through the month will be more precise.
From these we can also do things like `weekends`, `weekdays`, `morning` and holidays such as Christmas and Easter.
Remember that there are libraries to extract these for you.
Some of the measurements here can be extracted as categorical rather than numeric, things like `day of week` can either be extracted as `1, 2, 3` or `Monday, Tuesday, Wednesday` or `month` that can be extracted.
We can think of the numeric extract as being an automatic [Label Encoding](categorical-label) of the categorical version.
Sometimes it will be worth extracting the categorical version and using it directly or embedding it with other methods in the [Categorical section](categorical).
It is worth noting that the categorical version of these variables is ordinal.
::: callout-note
Most of the advice here tracks to sub-second measurements too if that applies to your problem.
:::
The talk in this chapter is very Euro and US-focused.
Many cultures around the world divide the "year" and "day" up differently.
Always use the conventions that are most appropriate to the culture you are working with.
## Pros and Cons
### Pros
- Fast and easy computations
- Can provide good results
### Cons
- The numerical features generated are all increasing with time linearly
- There are a lot of extractions, and they correlate quite a bit
## R Examples
We will be using the `hotel_bookings` data set for these examples.
```{r}
#| label: hotel_bookings
#| echo: false #|
#| message: false
library(tidymodels)
hotel_bookings <- readr::read_csv("data/hotel_bookings.csv")
```
```{r}
#| label: reservations_status_date
library(recipes)
hotel_bookings |>
select(reservation_status_date)
```
{recipes} provide two steps for date time extraction.
`step_date()` handles dates, and `step_time()` handles the sub-day time features.
The steps work the same way, so we will only show how `step_date()` works here.
A couple of features are selected by default,
```{r}
#| label: step_date
date_rec <- recipe(is_canceled ~ reservation_status_date,
data = hotel_bookings) |>
step_date(reservation_status_date)
date_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
But you can use the `features` argument to specify other types as well
```{r}
#| label: step_date-features
date_rec <- recipe(is_canceled ~ reservation_status_date,
data = hotel_bookings) |>
step_date(reservation_status_date,
features = c( "year", "doy", "week", "decimal", "semester",
"quarter", "dow", "month"))
date_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
features that can be categorical will be so by default, but can be turned off by setting `label = FALSE`.
```{r}
#| label: step_date-label
date_rec <- recipe(is_canceled ~ reservation_status_date,
data = hotel_bookings) |>
step_date(reservation_status_date,
features = c( "year", "doy", "week", "decimal", "semester",
"quarter", "dow", "month"), label = FALSE)
date_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
If we want to extract holiday features, we can use the `step_holiday()` function, which uses the {timeDate} library.
With known holidays listed in `timeDate::listHolidays()`.
```{r}
#| label: step_holiday
date_rec <- recipe(is_canceled ~ reservation_status_date,
data = hotel_bookings) |>
step_holiday(reservation_status_date,
holidays = c("BoxingDay", "CAFamilyDay", "JPConstitutionDay"))
date_rec |>
prep() |>
bake(new_data = NULL) |>
glimpse()
```
## Python Examples
I'm not aware of a good way to do this in a scikit-learn way.
Please file an [issue on github](https://github.com/EmilHvitfeldt/feature-engineering-az/issues/new?assignees=&labels=&projects=&template=general-issue.md&title) if you know of a good way.