Sometimes date variables come in many data sets and problems. The presence of a date time variable doesnβt mean that we are working with a time series data set. If you are, then read more about how to handle that in Chapter 112. All the extract methods are not learned, as there isnβt anything to estimate.
Dates and date times are typically stored as integers. With dates represented as the number of days since 1970-01-01, with negative values for earlier dates. with date times representing the number of seconds since 1970-01-01.
This means that if you are lucky and the model you were trying to fit knew to convert dates and date times into integers, that βtime since 1970-01-01β would be a helpful predictor. If not then we need to do some more advanced work.
We have assumed for this chapter that we already have the date variables in the right format. This would typically be
YYYYβMMβDD
for dates, and
YYYYβMMβDD HH:MM:SS
for date times.
Note
There are other formats, and it isnβt too big of a deal, as long as the libraries you are using can handle them.
There is also the wrinkle concerning time zones, leap years and leap seconds. Many rules can mess you up. It is for this reason that we recommend that you use a trusted datetime library to do the following calculations.
TODO
find a good reference to messed up time, for the above paragraph
Most libraries allow us to pull out standard measurements like
year
month
day
hour
minute
second
but we can include a couple more. quarter, semester, week. These are highly related to the above list, they are just in a different format. season is another possibly nice feature, but we have to be careful as the seasons flip depending on where on the globe we are.
In our above list of features, each of them counts up from 1, until we reach the level level. So minutes stop at 60 and start over. This might not be what we want, so we can include finer detail by extracting seconds in day, days in week, and days in year. It will be up to you to figure out if these are useful for you. Generally, these become useful in the broader sense. seconds in day is more finely grained than hour, so if you want to figure out the time of day, then the smaller measurements are helpful. Likewise, some of these methods are better expressed as decimals, as periods such as month have different lengths, and will thus be different. Talking about being 0.9 through the month will be more precise.
From these we can also do things like weekends, weekdays, morning and holidays such as Christmas and Easter. Remember that there are libraries to extract these for you.
Some of the measurements here can be extracted as categorical rather than numeric, things like day of week can either be extracted as 1, 2, 3 or Monday, Tuesday, Wednesday or month that can be extracted. We can think of the numeric extract as being an automatic label encoding Chapter 19 of the categorical version. Sometimes it will be worth extracting the categorical version and using it directly or embedding it with other methods in Chapter 15. It is worth noting that the categorical version of these variables is ordinal.
Note
Most of the advice here tracks to sub-second measurements too if that applies to your problem.
The talk in this chapter is very Euro and US-focused. Many cultures around the world divide the βyearβ and βdayβ up differently. Always use the conventions that are most appropriate to the culture you are working with.
39.2 Pros and Cons
39.2.1 Pros
Fast and easy computations
Can provide good results
39.2.2 Cons
The numerical features generated are all increasing with time linearly
There are a lot of extractions, and they correlate quite a bit
39.3 R Examples
We will be using the hotel_bookings data set for these examples.
{recipes} provide two steps for date time extraction. step_date() handles dates, and step_time() handles the sub-day time features. The steps work the same way, so we will only show how step_date() works here. A couple of features are selected by default,
If we want to extract holiday features, we can use the step_holiday() function, which uses the {timeDate} library. With known holidays listed in timeDate::listHolidays().