39  Value Extraction

Sometimes date variables come in many data sets and problems. The presence of a date time variable doesn’t mean that we are working with a time series data set. If you are, then read more about how to handle that in Chapter 112. All the extract methods are not learned, as there isn’t anything to estimate.

Dates and date times are typically stored as integers. With dates represented as the number of days since 1970-01-01, with negative values for earlier dates. with date times representing the number of seconds since 1970-01-01.

This means that if you are lucky and the model you were trying to fit knew to convert dates and date times into integers, that β€œtime since 1970-01-01” would be a helpful predictor. If not then we need to do some more advanced work.

We have assumed for this chapter that we already have the date variables in the right format. This would typically be

YYYY‐MM‐DD

for dates, and

YYYY‐MM‐DD HH:MM:SS

for date times.

Note

There are other formats, and it isn’t too big of a deal, as long as the libraries you are using can handle them.

There is also the wrinkle concerning time zones, leap years and leap seconds. Many rules can mess you up. It is for this reason that we recommend that you use a trusted datetime library to do the following calculations.

TODO

find a good reference to messed up time, for the above paragraph

Most libraries allow us to pull out standard measurements like

but we can include a couple more. quarter, semester, week. These are highly related to the above list, they are just in a different format. season is another possibly nice feature, but we have to be careful as the seasons flip depending on where on the globe we are.

In our above list of features, each of them counts up from 1, until we reach the level level. So minutes stop at 60 and start over. This might not be what we want, so we can include finer detail by extracting seconds in day, days in week, and days in year. It will be up to you to figure out if these are useful for you. Generally, these become useful in the broader sense. seconds in day is more finely grained than hour, so if you want to figure out the time of day, then the smaller measurements are helpful. Likewise, some of these methods are better expressed as decimals, as periods such as month have different lengths, and will thus be different. Talking about being 0.9 through the month will be more precise.

From these we can also do things like weekends, weekdays, morning and holidays such as Christmas and Easter. Remember that there are libraries to extract these for you.

Some of the measurements here can be extracted as categorical rather than numeric, things like day of week can either be extracted as 1, 2, 3 or Monday, Tuesday, Wednesday or month that can be extracted. We can think of the numeric extract as being an automatic label encoding Chapter 19 of the categorical version. Sometimes it will be worth extracting the categorical version and using it directly or embedding it with other methods in Chapter 15. It is worth noting that the categorical version of these variables is ordinal.

Note

Most of the advice here tracks to sub-second measurements too if that applies to your problem.

The talk in this chapter is very Euro and US-focused. Many cultures around the world divide the β€œyear” and β€œday” up differently. Always use the conventions that are most appropriate to the culture you are working with.

39.2 Pros and Cons

39.2.1 Pros

  • Fast and easy computations
  • Can provide good results

39.2.2 Cons

  • The numerical features generated are all increasing with time linearly
  • There are a lot of extractions, and they correlate quite a bit

39.3 R Examples

We will be using the hotel_bookings data set for these examples.

library(recipes)

hotel_bookings |>
  select(reservation_status_date)
# A tibble: 119,390 Γ— 1
   reservation_status_date
   <date>                 
 1 2015-07-01             
 2 2015-07-01             
 3 2015-07-02             
 4 2015-07-02             
 5 2015-07-03             
 6 2015-07-03             
 7 2015-07-03             
 8 2015-07-03             
 9 2015-05-06             
10 2015-04-22             
# β„Ή 119,380 more rows

{recipes} provide two steps for date time extraction. step_date() handles dates, and step_time() handles the sub-day time features. The steps work the same way, so we will only show how step_date() works here. A couple of features are selected by default,

date_rec <- recipe(is_canceled ~ reservation_status_date, 
                   data = hotel_bookings) |>
  step_date(reservation_status_date)

date_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 119,390
Columns: 5
$ reservation_status_date       <date> 2015-07-01, 2015-07-01, 2015-07-02, 201…
$ is_canceled                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0…
$ reservation_status_date_dow   <fct> Wed, Wed, Thu, Thu, Fri, Fri, Fri, Fri, …
$ reservation_status_date_month <fct> Jul, Jul, Jul, Jul, Jul, Jul, Jul, Jul, …
$ reservation_status_date_year  <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015…

But you can use the features argument to specify other types as well

date_rec <- recipe(is_canceled ~ reservation_status_date, 
                   data = hotel_bookings) |>
  step_date(reservation_status_date, 
            features = c( "year", "doy", "week", "decimal", "semester", 
                          "quarter", "dow", "month"))

date_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 119,390
Columns: 10
$ reservation_status_date          <date> 2015-07-01, 2015-07-01, 2015-07-02, …
$ is_canceled                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ reservation_status_date_year     <int> 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ reservation_status_date_doy      <int> 182, 182, 183, 183, 184, 184, 184, 18…
$ reservation_status_date_week     <int> 26, 26, 27, 27, 27, 27, 27, 27, 18, 1…
$ reservation_status_date_decimal  <dbl> 2015.496, 2015.496, 2015.499, 2015.49…
$ reservation_status_date_semester <int> 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ reservation_status_date_quarter  <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3…
$ reservation_status_date_dow      <fct> Wed, Wed, Thu, Thu, Fri, Fri, Fri, Fr…
$ reservation_status_date_month    <fct> Jul, Jul, Jul, Jul, Jul, Jul, Jul, Ju…

features that can be categorical will be so by default, but can be turned off by setting label = FALSE.

date_rec <- recipe(is_canceled ~ reservation_status_date, 
                   data = hotel_bookings) |>
  step_date(reservation_status_date, 
            features = c( "year", "doy", "week", "decimal", "semester", 
                          "quarter", "dow", "month"), label = FALSE)

date_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 119,390
Columns: 10
$ reservation_status_date          <date> 2015-07-01, 2015-07-01, 2015-07-02, …
$ is_canceled                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0…
$ reservation_status_date_year     <int> 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ reservation_status_date_doy      <int> 182, 182, 183, 183, 184, 184, 184, 18…
$ reservation_status_date_week     <int> 26, 26, 27, 27, 27, 27, 27, 27, 18, 1…
$ reservation_status_date_decimal  <dbl> 2015.496, 2015.496, 2015.499, 2015.49…
$ reservation_status_date_semester <int> 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ reservation_status_date_quarter  <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3…
$ reservation_status_date_dow      <int> 4, 4, 5, 5, 6, 6, 6, 6, 4, 4, 3, 1, 1…
$ reservation_status_date_month    <int> 7, 7, 7, 7, 7, 7, 7, 7, 5, 4, 6, 7, 7…

If we want to extract holiday features, we can use the step_holiday() function, which uses the {timeDate} library. With known holidays listed in timeDate::listHolidays().

date_rec <- recipe(is_canceled ~ reservation_status_date, 
                   data = hotel_bookings) |>
  step_holiday(reservation_status_date, 
               holidays = c("BoxingDay", "CAFamilyDay", "JPConstitutionDay"))

date_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 119,390
Columns: 5
$ reservation_status_date                   <date> 2015-07-01, 2015-07-01, 201…
$ is_canceled                               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
$ reservation_status_date_BoxingDay         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reservation_status_date_CAFamilyDay       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reservation_status_date_JPConstitutionDay <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

39.4 Python Examples

I’m not aware of a good way to do this in a scikit-learn way. Please file an issue on github if you know of a good way.