18  Dummy Encoding

We have some categorical variables and we want to turn them into numerical values, one of the most common ways of going about it is to create dummy variables. Dummy variables are variables that only take the values 0 and 1 to indicate the absence or presence of the levels in a categorical variable. This is nicely shown with an example.

Considering this short categorical variable of animals, we observe there are 3 unique values β€œcat”, β€œdog”, and β€œhorse”.

[1] "dog"   "cat"   "horse" "dog"   "cat"  

With just this knowledge we can create the corresponding dummy variables. There should be 3 columns one for each of the levels

     cat dog horse
[1,]   0   1     0
[2,]   1   0     0
[3,]   0   0     1
[4,]   0   1     0
[5,]   1   0     0

From this, we have a couple of observations. Firstly the length of each of these variables is equal to the length of the original categorical variable. The number of columns corresponds to the number of levels. Lastly, the sum of all the values on each row equals 1 since all the rows contain one 1 and the remaining 0s. This means that for even a small number of levels, you get sparse data. Sparse data is data where there are a lot of zeroes, meaning that it would take less space to store where the non-zero values are instead of all the values. You can read more about how and when to care about sparse data in Chapter 144. What this means for dummy variable creation, is that depending on whether your software can handle sparse data, you might need to limit the number of levels in your categorical variables. One way to do this would be to collapse levels together, which you can read about in Chapter 35.

Dummy variable creation is a trained method. This means that during the training step, the levels of each categorical variable are saved, and then these and only these values are used for dummy variable creation. If we assumed that the above example data were used to train the preprocessor, and we passed in the values ["dog", "cat", "cat", "dog"] during future applications, we would expect the following dummy variables

     cat dog horse
[1,]   0   1     0
[2,]   1   0     0
[3,]   1   0     0
[4,]   0   1     0

the horse variable must be here too, even if it is empty as the subsequent preprocessing steps and model expect the horse variable to be present. Likewise, you can run into problems if the value "duck" was used as the preprocessor wouldn’t know what to do. These cases are talked about in Chapter 17.

18.2 Dummy or one-hot encoding

TODO

add diagram

The terms dummy encoding and one-hot encoding get thrown around interchangeably, but they do have different and distinct meanings. One-hot encoding is when you return k variables when you have k different levels. Like we have shown above

     cat dog horse
[1,]   0   1     0
[2,]   1   0     0
[3,]   0   0     1
[4,]   0   1     0
[5,]   1   0     0

Dummy encoding on the other hand returns k-1 variables, where the excluded one typically is the first one.

     dog horse
[1,]   1     0
[2,]   0     0
[3,]   0     1
[4,]   1     0
[5,]   0     0

These two encodings store the same information, even though the dummy encoding has 1 less column. Because we can deduce which observations are cat by finding the rows with all zeros. The main reason why one would use dummy variables is because of what some people call the dummy variable trap. When you use one-hot encoding, you are increasing the likelihood that you run into a collinearity problem. With the above example, if you included an intercept in your model you have that intercept = cat + dog + horse which gives perfect collinearity and would cause some models to error as they aren’t able to handle that.

Note

An intercept is a variable that takes the value 1 for all entries.

Even if you don’t include an intercept you could still run into collinearity. Imagine that in addition to the animal variable also creates a one-hot encoding of the home variable taking the two values "house" and "apartment", you would get the following indicator variables

     cat dog horse house apartment
[1,]   0   1     0     0         1
[2,]   1   0     0     1         0
[3,]   0   0     1     0         1
[4,]   0   1     0     0         1
[5,]   1   0     0     1         0

And in this case, we have that house = cat + dog + horse - apartment which again is an example of perfect collinearity. Unless you have a reason to do otherwise I would suggest that you use dummy encoding in your models. Additionally, this leads to slightly smaller models as each categorical variable produces 1 less variable. It is worth noting that the choice between dummy encoding and one-hot encoding does matter for some models such as decision trees. Depending on what types of rules they can use. Being able to write animal == "cat" is easier then saying animal != "dog" & animal != "horse". This is unlikely to be an issue as many tree-based models can work with categorical variables directly without the need for encoding.

18.3 Ordered factors

TODO

finish section

18.4 Contrasts

TODO

finish section

18.5 Pros and Cons

18.5.1 Pros

  • Versatile and commonly used
  • Easy interpretation
  • Will rarely lead to a decrease in performance

18.5.2 Cons

  • Does require fairly clean categorical levels
  • Can be quite memory intensive if you have many levels in your categorical variables and you are unable to use sparse representation
  • Provides a complete, but not necessarily compact set of variables

18.6 R Examples

We will be using the ames data set for these examples. The step_dummy() function allows us to perform dummy encoding and one-hot encoding.

library(recipes)
library(modeldata)
data("ames")

ames |>
  select(Sale_Price, MS_SubClass, MS_Zoning)
# A tibble: 2,930 Γ— 3
   Sale_Price MS_SubClass                         MS_Zoning               
        <int> <fct>                               <fct>                   
 1     215000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 2     105000 One_Story_1946_and_Newer_All_Styles Residential_High_Density
 3     172000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 4     244000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 5     189900 Two_Story_1946_and_Newer            Residential_Low_Density 
 6     195500 Two_Story_1946_and_Newer            Residential_Low_Density 
 7     213500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 8     191500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 9     236500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
10     189000 Two_Story_1946_and_Newer            Residential_Low_Density 
# β„Ή 2,920 more rows

We can take a quick look at the possible values MS_SubClass takes

ames |>
  count(MS_SubClass, sort = TRUE)
# A tibble: 16 Γ— 2
   MS_SubClass                                   n
   <fct>                                     <int>
 1 One_Story_1946_and_Newer_All_Styles        1079
 2 Two_Story_1946_and_Newer                    575
 3 One_and_Half_Story_Finished_All_Ages        287
 4 One_Story_PUD_1946_and_Newer                192
 5 One_Story_1945_and_Older                    139
 6 Two_Story_PUD_1946_and_Newer                129
 7 Two_Story_1945_and_Older                    128
 8 Split_or_Multilevel                         118
 9 Duplex_All_Styles_and_Ages                  109
10 Two_Family_conversion_All_Styles_and_Ages    61
11 Split_Foyer                                  48
12 Two_and_Half_Story_All_Ages                  23
13 One_and_Half_Story_Unfinished_All_Ages       18
14 PUD_Multilevel_Split_Level_Foyer             17
15 One_Story_with_Finished_Attic_All_Ages        6
16 One_and_Half_Story_PUD_All_Ages               1

And since MS_SubClass is a factor, we can verify that they match and that all the levels are observed

ames |> pull(MS_SubClass) |> levels()
 [1] "One_Story_1946_and_Newer_All_Styles"      
 [2] "One_Story_1945_and_Older"                 
 [3] "One_Story_with_Finished_Attic_All_Ages"   
 [4] "One_and_Half_Story_Unfinished_All_Ages"   
 [5] "One_and_Half_Story_Finished_All_Ages"     
 [6] "Two_Story_1946_and_Newer"                 
 [7] "Two_Story_1945_and_Older"                 
 [8] "Two_and_Half_Story_All_Ages"              
 [9] "Split_or_Multilevel"                      
[10] "Split_Foyer"                              
[11] "Duplex_All_Styles_and_Ages"               
[12] "One_Story_PUD_1946_and_Newer"             
[13] "One_and_Half_Story_PUD_All_Ages"          
[14] "Two_Story_PUD_1946_and_Newer"             
[15] "PUD_Multilevel_Split_Level_Foyer"         
[16] "Two_Family_conversion_All_Styles_and_Ages"

We will be using the step_dummy() step for this, which defaults to creating dummy variables

dummy_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  prep()

dummy_rec |>
  bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
  glimpse()
Rows: 2,930
Columns: 21
$ MS_SubClass_One_Story_1945_and_Older                  <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_with_Finished_Attic_All_Ages    <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Unfinished_All_Ages    <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Finished_All_Ages      <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_1946_and_Newer                  <dbl> 0, 0, 0, 0, 1, 1…
$ MS_SubClass_Two_Story_1945_and_Older                  <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_and_Half_Story_All_Ages               <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_or_Multilevel                       <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_Foyer                               <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Duplex_All_Styles_and_Ages                <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_PUD_1946_and_Newer              <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_PUD_All_Ages           <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_PUD_1946_and_Newer              <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_PUD_Multilevel_Split_Level_Foyer          <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Family_conversion_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_Residential_High_Density                    <dbl> 0, 1, 0, 0, 0, 0…
$ MS_Zoning_Residential_Low_Density                     <dbl> 1, 0, 1, 1, 1, 1…
$ MS_Zoning_Residential_Medium_Density                  <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_A_agr                                       <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_C_all                                       <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_I_all                                       <dbl> 0, 0, 0, 0, 0, 0…

We can pull the factor levels for each variable by using tidy(). If a character vector was present in the data set, it would record the observed variables.

dummy_rec |>
  tidy(1)
# A tibble: 243 Γ— 3
   terms       columns                                id         
   <chr>       <chr>                                  <chr>      
 1 MS_SubClass One_Story_1945_and_Older               dummy_Bp5vK
 2 MS_SubClass One_Story_with_Finished_Attic_All_Ages dummy_Bp5vK
 3 MS_SubClass One_and_Half_Story_Unfinished_All_Ages dummy_Bp5vK
 4 MS_SubClass One_and_Half_Story_Finished_All_Ages   dummy_Bp5vK
 5 MS_SubClass Two_Story_1946_and_Newer               dummy_Bp5vK
 6 MS_SubClass Two_Story_1945_and_Older               dummy_Bp5vK
 7 MS_SubClass Two_and_Half_Story_All_Ages            dummy_Bp5vK
 8 MS_SubClass Split_or_Multilevel                    dummy_Bp5vK
 9 MS_SubClass Split_Foyer                            dummy_Bp5vK
10 MS_SubClass Duplex_All_Styles_and_Ages             dummy_Bp5vK
# β„Ή 233 more rows

setting one_hot = TRUE gives us the complete one-hot encoding results.

onehot_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  prep()

onehot_rec |>
  bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
  glimpse()
Rows: 2,930
Columns: 23
$ MS_SubClass_One_Story_1946_and_Newer_All_Styles       <dbl> 1, 1, 1, 1, 0, 0…
$ MS_SubClass_One_Story_1945_and_Older                  <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_with_Finished_Attic_All_Ages    <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Unfinished_All_Ages    <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Finished_All_Ages      <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_1946_and_Newer                  <dbl> 0, 0, 0, 0, 1, 1…
$ MS_SubClass_Two_Story_1945_and_Older                  <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_and_Half_Story_All_Ages               <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_or_Multilevel                       <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_Foyer                               <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Duplex_All_Styles_and_Ages                <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_PUD_1946_and_Newer              <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_PUD_All_Ages           <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_PUD_1946_and_Newer              <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_PUD_Multilevel_Split_Level_Foyer          <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Family_conversion_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_Floating_Village_Residential                <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_Residential_High_Density                    <dbl> 0, 1, 0, 0, 0, 0…
$ MS_Zoning_Residential_Low_Density                     <dbl> 1, 0, 1, 1, 1, 1…
$ MS_Zoning_Residential_Medium_Density                  <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_A_agr                                       <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_C_all                                       <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_I_all                                       <dbl> 0, 0, 0, 0, 0, 0…

18.7 Python Examples

We are using the ames data set for examples. {sklearn} provided the OneHotEncoder() method we can use. Below we see how it can be used with the MS_Zoning columns.

Note

We are setting sparse_output=False in this example because we are having transform() return pandas data frames for better printing.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    [('onehot', OneHotEncoder(sparse_output=False), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('onehot', OneHotEncoder(sparse_output=False),
                                 ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames).filter(regex=("MS_Zoning.*"))
      onehot__MS_Zoning_A_agr  ...  onehot__MS_Zoning_Residential_Medium_Density
0                         0.0  ...                                           0.0
1                         0.0  ...                                           0.0
2                         0.0  ...                                           0.0
3                         0.0  ...                                           0.0
4                         0.0  ...                                           0.0
...                       ...  ...                                           ...
2925                      0.0  ...                                           0.0
2926                      0.0  ...                                           0.0
2927                      0.0  ...                                           0.0
2928                      0.0  ...                                           0.0
2929                      0.0  ...                                           0.0

[2930 rows x 7 columns]

By default OneHotEncoder() performs one-hot encoding, we can change this to dummy encoding by setting drop='first'.

ct = ColumnTransformer(
    [('dummy', OneHotEncoder(sparse_output=False, drop='first'), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('dummy',
                                 OneHotEncoder(drop='first',
                                               sparse_output=False),
                                 ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames).filter(regex=("MS_Zoning.*"))
      dummy__MS_Zoning_C_all  ...  dummy__MS_Zoning_Residential_Medium_Density
0                        0.0  ...                                          0.0
1                        0.0  ...                                          0.0
2                        0.0  ...                                          0.0
3                        0.0  ...                                          0.0
4                        0.0  ...                                          0.0
...                      ...  ...                                          ...
2925                     0.0  ...                                          0.0
2926                     0.0  ...                                          0.0
2927                     0.0  ...                                          0.0
2928                     0.0  ...                                          0.0
2929                     0.0  ...                                          0.0

[2930 rows x 6 columns]