library(recipes)
library(rsample)
library(nycflights13)
<- flights |>
flights filter(year == 2013, month == 1, day == 1)
set.seed(13630)
<- initial_split(flights)
flights_split <- training(flights_split)
flights_train <- testing(flights_split) flights_test
17 Unseen Levels
17.1 Unseen Levels
When you are dealing with categorical variables, it is understood that they can take many values. And we have various methods about how to deal with these categorical values, regardless of what values they take. One problem that eventually will happen for you is that you try to apply a trained preprocessor on data that has levels in your categorical variable that you havenβt seen before. This can happen when you are fitting your model using resampled data when you are applying your model on the testing data set, or, if you are unlucky, at some future time in production.
add diagram here
The reason why you need to think about this problem is that some methods and/or models will complain and even error if you are providing unseen levels. Some implementations will allow you to deal with this at the method level. Other methods such as Chapter 24 donβt care at all that you have unseen levels in your data.
One surefire way to deal with this issue is to add a step in your data preprocessing pipeline that will turn any unseen levels into "unseen"
. What this method does in practice, is that it looks at your categorical variables during training, taking note of all the levels it sees and saving them. Then any time the preprocessing is applied to new data it will look at the levels again, and if it sees a level it hasnβt seen, label it "unseen"
(or any other meaningful label that doesnβt conflict with the data). This way, you have any future levels.
The above method will only work if the programming language you are modeling with has a factor like data class.
add diagram here
17.2 R Examples
We will be using the nycflights13 data set. We are downsampling just a bit to only work on the first day and doing a test-train split.
Now we are doing the cardinal sin by looking at the testing data. But in this case, it is okay because we are doing it for educational purposes.
|> pull(carrier) |> unique() |> sort() flights_train
[1] "9E" "AA" "B6" "DL" "EV" "F9" "FL" "MQ" "UA" "US" "VX" "WN"
|> pull(carrier) |> unique() |> sort() flights_test
[1] "9E" "AA" "AS" "B6" "DL" "EV" "FL" "HA" "MQ" "UA" "US" "VX" "WN"
Notice that the testing data includes the carrier "AS"
and "HA"
but the training data doesnβt know that. Let us see what would happen if we were to calculate dummy variables without doing any adjusting.
<- recipe(arr_delay ~ carrier, data = flights_train) |>
dummy_spec step_dummy(carrier)
<- prep(dummy_spec)
dummy_spec_prepped
bake(dummy_spec_prepped, new_data = flights_test)
Warning: ! There are new levels in `carrier`: "AS" and "HA".
βΉ Consider using step_novel() (`?recipes::step_novel()`) before `step_dummy()`
to handle unseen values.
# A tibble: 211 Γ 12
arr_delay carrier_AA carrier_B6 carrier_DL carrier_EV carrier_F9 carrier_FL
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 0 0 0 0 0 0
2 8 1 0 0 0 0 0
3 -14 0 0 0 0 0 0
4 -6 0 1 0 0 0 0
5 -3 1 0 0 0 0 0
6 -33 0 0 1 0 0 0
7 -7 1 0 0 0 0 0
8 5 0 1 0 0 0 0
9 31 1 0 0 0 0 0
10 -10 NA NA NA NA NA NA
# βΉ 201 more rows
# βΉ 5 more variables: carrier_MQ <dbl>, carrier_UA <dbl>, carrier_US <dbl>,
# carrier_VX <dbl>, carrier_WN <dbl>
We get a warning, and if you look at the rows that were affected we see that it produces NAs. Let us now use the function step_novel()
that implements the above-described method.
<- recipe(arr_delay ~ carrier, data = flights_train) |>
novel_spec step_novel(carrier) |>
step_dummy(carrier)
<- prep(novel_spec)
novel_spec_prepped
bake(novel_spec_prepped, new_data = flights_test)
# A tibble: 211 Γ 13
arr_delay carrier_AA carrier_B6 carrier_DL carrier_EV carrier_F9 carrier_FL
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 0 0 0 0 0 0
2 8 1 0 0 0 0 0
3 -14 0 0 0 0 0 0
4 -6 0 1 0 0 0 0
5 -3 1 0 0 0 0 0
6 -33 0 0 1 0 0 0
7 -7 1 0 0 0 0 0
8 5 0 1 0 0 0 0
9 31 1 0 0 0 0 0
10 -10 0 0 0 0 0 0
# βΉ 201 more rows
# βΉ 6 more variables: carrier_MQ <dbl>, carrier_UA <dbl>, carrier_US <dbl>,
# carrier_VX <dbl>, carrier_WN <dbl>, carrier_new <dbl>
And we see that we get no error or anything.
17.3 Python Examples
Iβm not aware of a good way to do this in a scikit-learn way. Please file an issue on github if you know of a good way.