45  Missing Values Indicators

While imputation can be useful, as we saw in the Simple Imputation and Model Based Imputation chapters. That by itself isn’t always enough to extract all the information. As was described in Missing section, missing values can come in different variants, and depending on the variant, imputation might not give enough information. Suppose you are working with non-MCAR data (non Missing Completely At Random). Then we have some mechanism that determines when missing values occur. This mechanism might be known or unknown. From a predictive standpoint whether or not it is known doesn’t matter as much, what matters is whether the mechanism is related to the outcome or not.

This is where missing value indicators come in. Used in combination with imputation, missing value indicators will try to capture that signal. For each chosen variable, create another Boolean variable that is 1 when a missing value is seen, and 0 otherwise.

The following sample data set

# A tibble: 5 Γ— 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     6     3
2     4    NA     3
3     0    NA    NA
4    NA     3     5
5     5    NA     3

Will look like the data set below, once missing value indicators have been added.

# A tibble: 5 Γ— 6
      a     b     c  a_na  b_na  c_na
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     6     3     0     0     0
2     4    NA     3     0     1     0
3     0    NA    NA     0     1     1
4    NA     3     5     1     0     0
5     5    NA     3     0     1     0

From here on, you are potentially adding information, otherwise we are adding a lot of noise. The noise here can be filtered by other methods seen in this book. If variables with no missing data were used, then we create zero variance predictors, which we can deal with as seen in Zero Variance chapter.

45.2 Pros and Cons

45.2.1 Pros

  • No performance harm when added to variables with no missing data
  • Simple and interpretable

45.2.2 Cons

  • Will produce zero variance columns when used on data with no missing values
  • Can create a sizable increase in data set size

45.3 R Examples

TODO

find a better data set

From the recipes package, can we use the step_indicate_na() function to create indicator variables based on missing data

library(recipes)

na_ind_rec <- recipe(mpg ~ disp + vs + am, data = mtcars) |>
  step_indicate_na(all_predictors()) |>
  prep()
na_ind_rec |>
  bake(new_data = mtcars)
# A tibble: 32 Γ— 7
    disp    vs    am   mpg na_ind_disp na_ind_vs na_ind_am
   <dbl> <dbl> <dbl> <dbl>       <int>     <int>     <int>
 1  160      0     1  21             0         0         0
 2  160      0     1  21             0         0         0
 3  108      1     1  22.8           0         0         0
 4  258      1     0  21.4           0         0         0
 5  360      0     0  18.7           0         0         0
 6  225      1     0  18.1           0         0         0
 7  360      0     0  14.3           0         0         0
 8  147.     1     0  24.4           0         0         0
 9  141.     1     0  22.8           0         0         0
10  168.     1     0  19.2           0         0         0
# β„Ή 22 more rows

45.4 Python Examples

We are using the ames data set for examples. {sklearn} provided the MissingIndicator() method we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.impute import MissingIndicator

ct = ColumnTransformer(
    [('na_indicator', MissingIndicator(), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',  'Mas_Vnr_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('na_indicator', MissingIndicator(),
                                 ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',
                                  'Mas_Vnr_Area'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
                   remainder__MS_SubClass  ... remainder__Latitude
0     One_Story_1946_and_Newer_All_Styles  ...              42.054
1     One_Story_1946_and_Newer_All_Styles  ...              42.053
2     One_Story_1946_and_Newer_All_Styles  ...              42.053
3     One_Story_1946_and_Newer_All_Styles  ...              42.051
4                Two_Story_1946_and_Newer  ...              42.061
...                                   ...  ...                 ...
2925                  Split_or_Multilevel  ...              41.989
2926  One_Story_1946_and_Newer_All_Styles  ...              41.988
2927                          Split_Foyer  ...              41.987
2928  One_Story_1946_and_Newer_All_Styles  ...              41.991
2929             Two_Story_1946_and_Newer  ...              41.989

[2930 rows x 70 columns]