17  Unseen Levels

When you are dealing with categorical variables, it is understood that they can take many values. And we have various methods about how to deal with these categorical values, regardless of what values they take. One problem that eventually will happen for you is that you try to apply a trained preprocessor on data that has levels in your categorical variable that you haven’t seen before. This can happen when you are fitting your model using resampled data when you are applying your model on the testing data set, or, if you are unlucky, at some future time in production.

TODO

add diagram here

The reason why you need to think about this problem is that some methods and/or models will complain and even error if you are providing unseen levels. Some implementations will allow you to deal with this at the method level. Other methods such as Hashing Encoding don’t care at all that you have unseen levels in your data.

One surefire way to deal with this issue is to add a step in your data preprocessing pipeline that will turn any unseen levels into "unseen". What this method does in practice, is that it looks at your categorical variables during training, taking note of all the levels it sees and saving them. Then any time the preprocessing is applied to new data it will look at the levels again, and if it sees a level it hasn’t seen, label it "unseen" (or any other meaningful label that doesn’t conflict with the data). This way, you have any future levels.

Note

The above method will only work if the programming language you are modeling with has a factor like data class.

TODO

add diagram here

17.2 R Examples

We will be using the nycflights13 data set. We are downsampling just a bit to only work on the first day and doing a test-train split.

library(recipes)
library(rsample)
library(nycflights13)

flights <- flights |>
  filter(year == 2013, month == 1, day == 1)

set.seed(13630)
flights_split <- initial_split(flights)
flights_train <- training(flights_split)
flights_test <- testing(flights_split)

Now we are doing the cardinal sin by looking at the testing data. But in this case, it is okay because we are doing it for educational purposes.

flights_train |> pull(carrier) |> unique() |> sort()
 [1] "9E" "AA" "B6" "DL" "EV" "F9" "FL" "MQ" "UA" "US" "VX" "WN"
flights_test |> pull(carrier) |> unique() |> sort()
 [1] "9E" "AA" "AS" "B6" "DL" "EV" "FL" "HA" "MQ" "UA" "US" "VX" "WN"

Notice that the testing data includes the carrier "AS" and "HA" but the training data doesn’t know that. Let us see what would happen if we were to calculate dummy variables without doing any adjusting.

dummy_spec <- recipe(arr_delay ~ carrier, data = flights_train) |>
  step_dummy(carrier)

dummy_spec_prepped <- prep(dummy_spec)

bake(dummy_spec_prepped, new_data = flights_test)
# A tibble: 211 Γ— 12
   arr_delay carrier_AA carrier_B6 carrier_DL carrier_EV carrier_F9 carrier_FL
       <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1        12          0          0          0          0          0          0
 2         8          1          0          0          0          0          0
 3       -14          0          0          0          0          0          0
 4        -6          0          1          0          0          0          0
 5        -3          1          0          0          0          0          0
 6       -33          0          0          1          0          0          0
 7        -7          1          0          0          0          0          0
 8         5          0          1          0          0          0          0
 9        31          1          0          0          0          0          0
10       -10         NA         NA         NA         NA         NA         NA
# β„Ή 201 more rows
# β„Ή 5 more variables: carrier_MQ <dbl>, carrier_UA <dbl>, carrier_US <dbl>,
#   carrier_VX <dbl>, carrier_WN <dbl>

We get a warning, and if you look at the rows that were affected we see that it produces NAs. Let us now use the function step_novel() that implements the above-described method.

novel_spec <- recipe(arr_delay ~ carrier, data = flights_train) |>
  step_novel(carrier) |>
  step_dummy(carrier) 

novel_spec_prepped <- prep(novel_spec)

bake(novel_spec_prepped, new_data = flights_test)
# A tibble: 211 Γ— 13
   arr_delay carrier_AA carrier_B6 carrier_DL carrier_EV carrier_F9 carrier_FL
       <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1        12          0          0          0          0          0          0
 2         8          1          0          0          0          0          0
 3       -14          0          0          0          0          0          0
 4        -6          0          1          0          0          0          0
 5        -3          1          0          0          0          0          0
 6       -33          0          0          1          0          0          0
 7        -7          1          0          0          0          0          0
 8         5          0          1          0          0          0          0
 9        31          1          0          0          0          0          0
10       -10          0          0          0          0          0          0
# β„Ή 201 more rows
# β„Ή 6 more variables: carrier_MQ <dbl>, carrier_UA <dbl>, carrier_US <dbl>,
#   carrier_VX <dbl>, carrier_WN <dbl>, carrier_new <dbl>

And we see that we get no error or anything.

17.3 Python Examples

I’m not aware of a good way to do this in a scikit-learn way. Please file an issue on github if you know of a good way.

See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.