"dog", "cat", "horse", "dog", "cat"
18 Dummy Encoding
18.1 Dummy Encoding
We have some categorical variables and we want to turn them into numerical values, one of the most common ways of going about it is to create dummy variables. Dummy variables are variables that only take the values 0 and 1 to indicate the absence or presence of the levels in a categorical variable. This is nicely shown with an example.
Considering this short categorical variable of animals, we observe there are 3 unique values “cat”, “dog”, and “horse”.
With just this knowledge we can create the corresponding dummy variables. There should be 3 columns one for each of the levels
cat | dog | horse |
---|---|---|
0 | 1 | 0 |
1 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
1 | 0 | 0 |
From this, we have a couple of observations. Firstly the length of each of these variables is equal to the length of the original categorical variable. The number of columns corresponds to the number of levels. Lastly, the sum of all the values on each row equals 1 since all the rows contain one 1 and the remaining 0s. This means that for even a small number of levels, you get sparse data. Sparse data is data where there are a lot of zeroes, meaning that it would take less space to store where the non-zero values are instead of all the values. You can read more about how and when to care about sparse data in Chapter 144. What this means for dummy variable creation, is that depending on whether your software can handle sparse data, you might need to limit the number of levels in your categorical variables. One way to do this would be to collapse levels together, which you can read about in Chapter 35.
Dummy variable creation is a trained method. This means that during the training step, the levels of each categorical variable are saved, and then these and only these values are used for dummy variable creation. If we assumed that the above example data were used to train the preprocessor, and we passed in the values ["dog", "cat", "cat", "dog"]
during future applications, we would expect the following dummy variables
cat | dog | horse |
---|---|---|
0 | 1 | 0 |
1 | 0 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
the horse
variable must be here too, even if it is empty as the subsequent preprocessing steps and model expect the horse variable to be present. Likewise, you can run into problems if the value "duck"
was used as the preprocessor wouldn’t know what to do. These cases are talked about in Chapter 17 about unseen levels.
18.2 Dummy or one-hot encoding
add diagram
The terms dummy encoding and one-hot encoding get thrown around interchangeably, but they do have different and distinct meanings. One-hot encoding is when you return k
variables when you have k
different levels. Like we have shown above
cat | dog | horse |
---|---|---|
0 | 1 | 0 |
1 | 0 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
1 | 0 | 0 |
Dummy encoding on the other hand returns k-1
variables, where the excluded one typically is the first one.
dog | horse |
---|---|
1 | 0 |
0 | 0 |
0 | 1 |
1 | 0 |
0 | 0 |
These two encodings store the same information, even though the dummy encoding has 1 less column. Because we can deduce which observations are cat
by finding the rows with all zeros. The main reason why one would use dummy variables is because of what some people call the dummy variable trap. When you use one-hot encoding, you are increasing the likelihood that you run into a collinearity problem. With the above example, if you included an intercept in your model you have that intercept = cat + dog + horse
which gives perfect collinearity and would cause some models to error as they aren’t able to handle that.
An intercept is a variable that takes the value 1 for all entries.
Even if you don’t include an intercept you could still run into collinearity. Imagine that in addition to the animal
variable also creates a one-hot encoding of the home
variable taking the two values "house"
and "apartment"
, you would get the following indicator variables
cat | dog | horse | house | apartment |
---|---|---|---|---|
0 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 |
0 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 |
And in this case, we have that house = cat + dog + horse - apartment
which again is an example of perfect collinearity. Unless you have a reason to do otherwise I would suggest that you use dummy encoding in your models. Additionally, this leads to slightly smaller models as each categorical variable produces 1 less variable. It is worth noting that the choice between dummy encoding and one-hot encoding does matter for some models such as decision trees. Depending on what types of rules they can use. Being able to write animal == "cat"
is easier then saying animal != "dog" & animal != "horse"
. This is unlikely to be an issue as many tree-based models can work with categorical variables directly without the need for encoding.
18.3 Ordered factors
finish section
18.4 Contrasts
finish section
18.5 Pros and Cons
18.5.1 Pros
- Versatile and commonly used
- Easy interpretation
- Will rarely lead to a decrease in performance
18.5.2 Cons
- Does require fairly clean categorical levels
- Can be quite memory intensive if you have many levels in your categorical variables and you are unable to use sparse representation
- Provides a complete, but not necessarily compact set of variables
18.6 R Examples
We will be using the ames
data set for these examples. The step_dummy()
function allows us to perform dummy encoding and one-hot encoding.
library(recipes)
library(modeldata)
data("ames")
|>
ames select(Sale_Price, MS_SubClass, MS_Zoning)
# A tibble: 2,930 × 3
Sale_Price MS_SubClass MS_Zoning
<int> <fct> <fct>
1 215000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density
2 105000 One_Story_1946_and_Newer_All_Styles Residential_High_Density
3 172000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density
4 244000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density
5 189900 Two_Story_1946_and_Newer Residential_Low_Density
6 195500 Two_Story_1946_and_Newer Residential_Low_Density
7 213500 One_Story_PUD_1946_and_Newer Residential_Low_Density
8 191500 One_Story_PUD_1946_and_Newer Residential_Low_Density
9 236500 One_Story_PUD_1946_and_Newer Residential_Low_Density
10 189000 Two_Story_1946_and_Newer Residential_Low_Density
# ℹ 2,920 more rows
We can take a quick look at the possible values MS_SubClass
takes
|>
ames count(MS_SubClass, sort = TRUE)
# A tibble: 16 × 2
MS_SubClass n
<fct> <int>
1 One_Story_1946_and_Newer_All_Styles 1079
2 Two_Story_1946_and_Newer 575
3 One_and_Half_Story_Finished_All_Ages 287
4 One_Story_PUD_1946_and_Newer 192
5 One_Story_1945_and_Older 139
6 Two_Story_PUD_1946_and_Newer 129
7 Two_Story_1945_and_Older 128
8 Split_or_Multilevel 118
9 Duplex_All_Styles_and_Ages 109
10 Two_Family_conversion_All_Styles_and_Ages 61
11 Split_Foyer 48
12 Two_and_Half_Story_All_Ages 23
13 One_and_Half_Story_Unfinished_All_Ages 18
14 PUD_Multilevel_Split_Level_Foyer 17
15 One_Story_with_Finished_Attic_All_Ages 6
16 One_and_Half_Story_PUD_All_Ages 1
And since MS_SubClass
is a factor, we can verify that they match and that all the levels are observed
|> pull(MS_SubClass) |> levels() ames
[1] "One_Story_1946_and_Newer_All_Styles"
[2] "One_Story_1945_and_Older"
[3] "One_Story_with_Finished_Attic_All_Ages"
[4] "One_and_Half_Story_Unfinished_All_Ages"
[5] "One_and_Half_Story_Finished_All_Ages"
[6] "Two_Story_1946_and_Newer"
[7] "Two_Story_1945_and_Older"
[8] "Two_and_Half_Story_All_Ages"
[9] "Split_or_Multilevel"
[10] "Split_Foyer"
[11] "Duplex_All_Styles_and_Ages"
[12] "One_Story_PUD_1946_and_Newer"
[13] "One_and_Half_Story_PUD_All_Ages"
[14] "Two_Story_PUD_1946_and_Newer"
[15] "PUD_Multilevel_Split_Level_Foyer"
[16] "Two_Family_conversion_All_Styles_and_Ages"
We will be using the step_dummy()
step for this, which defaults to creating dummy variables
<- recipe(Sale_Price ~ ., data = ames) |>
dummy_rec step_dummy(all_nominal_predictors()) |>
prep()
|>
dummy_rec bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
glimpse()
Rows: 2,930
Columns: 21
$ MS_SubClass_One_Story_1945_and_Older <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_with_Finished_Attic_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Unfinished_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Finished_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_1946_and_Newer <dbl> 0, 0, 0, 0, 1, 1…
$ MS_SubClass_Two_Story_1945_and_Older <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_and_Half_Story_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_or_Multilevel <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_Foyer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Duplex_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_PUD_1946_and_Newer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_PUD_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_PUD_1946_and_Newer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_PUD_Multilevel_Split_Level_Foyer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Family_conversion_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_Residential_High_Density <dbl> 0, 1, 0, 0, 0, 0…
$ MS_Zoning_Residential_Low_Density <dbl> 1, 0, 1, 1, 1, 1…
$ MS_Zoning_Residential_Medium_Density <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_A_agr <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_C_all <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_I_all <dbl> 0, 0, 0, 0, 0, 0…
We can pull the factor levels for each variable by using tidy()
. If a character vector was present in the data set, it would record the observed variables.
|>
dummy_rec tidy(1)
# A tibble: 243 × 3
terms columns id
<chr> <chr> <chr>
1 MS_SubClass One_Story_1945_and_Older dummy_Bp5vK
2 MS_SubClass One_Story_with_Finished_Attic_All_Ages dummy_Bp5vK
3 MS_SubClass One_and_Half_Story_Unfinished_All_Ages dummy_Bp5vK
4 MS_SubClass One_and_Half_Story_Finished_All_Ages dummy_Bp5vK
5 MS_SubClass Two_Story_1946_and_Newer dummy_Bp5vK
6 MS_SubClass Two_Story_1945_and_Older dummy_Bp5vK
7 MS_SubClass Two_and_Half_Story_All_Ages dummy_Bp5vK
8 MS_SubClass Split_or_Multilevel dummy_Bp5vK
9 MS_SubClass Split_Foyer dummy_Bp5vK
10 MS_SubClass Duplex_All_Styles_and_Ages dummy_Bp5vK
# ℹ 233 more rows
setting one_hot = TRUE
gives us the complete one-hot encoding results.
<- recipe(Sale_Price ~ ., data = ames) |>
onehot_rec step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
prep()
|>
onehot_rec bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
glimpse()
Rows: 2,930
Columns: 23
$ MS_SubClass_One_Story_1946_and_Newer_All_Styles <dbl> 1, 1, 1, 1, 0, 0…
$ MS_SubClass_One_Story_1945_and_Older <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_with_Finished_Attic_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Unfinished_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_Finished_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_1946_and_Newer <dbl> 0, 0, 0, 0, 1, 1…
$ MS_SubClass_Two_Story_1945_and_Older <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_and_Half_Story_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_or_Multilevel <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Split_Foyer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Duplex_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_Story_PUD_1946_and_Newer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_One_and_Half_Story_PUD_All_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Story_PUD_1946_and_Newer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_PUD_Multilevel_Split_Level_Foyer <dbl> 0, 0, 0, 0, 0, 0…
$ MS_SubClass_Two_Family_conversion_All_Styles_and_Ages <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_Floating_Village_Residential <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_Residential_High_Density <dbl> 0, 1, 0, 0, 0, 0…
$ MS_Zoning_Residential_Low_Density <dbl> 1, 0, 1, 1, 1, 1…
$ MS_Zoning_Residential_Medium_Density <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_A_agr <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_C_all <dbl> 0, 0, 0, 0, 0, 0…
$ MS_Zoning_I_all <dbl> 0, 0, 0, 0, 0, 0…
18.7 Python Examples
We are using the ames
data set for examples. {sklearn} provided the OneHotEncoder()
method we can use. Below we see how it can be used with the MS_Zoning
columns.
We are setting sparse_output=False
in this example because we are having transform()
return pandas data frames for better printing.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
= ColumnTransformer(
ct 'onehot', OneHotEncoder(sparse_output=False), ['MS_Zoning'])],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('onehot', OneHotEncoder(sparse_output=False), ['MS_Zoning'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('onehot', OneHotEncoder(sparse_output=False), ['MS_Zoning'])])
['MS_Zoning']
OneHotEncoder(sparse_output=False)
['MS_SubClass', 'Lot_Frontage', 'Lot_Area', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
filter(regex=("MS_Zoning.*")) ct.transform(ames).
onehot__MS_Zoning_A_agr ... onehot__MS_Zoning_Residential_Medium_Density
0 0.0 ... 0.0
1 0.0 ... 0.0
2 0.0 ... 0.0
3 0.0 ... 0.0
4 0.0 ... 0.0
... ... ... ...
2925 0.0 ... 0.0
2926 0.0 ... 0.0
2927 0.0 ... 0.0
2928 0.0 ... 0.0
2929 0.0 ... 0.0
[2930 rows x 7 columns]
By default OneHotEncoder()
performs one-hot encoding, we can change this to dummy encoding by setting drop='first'
.
= ColumnTransformer(
ct 'dummy', OneHotEncoder(sparse_output=False, drop='first'), ['MS_Zoning'])],
[(="passthrough")
remainder
ct.fit(ames)
ColumnTransformer(remainder='passthrough', transformers=[('dummy', OneHotEncoder(drop='first', sparse_output=False), ['MS_Zoning'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('dummy', OneHotEncoder(drop='first', sparse_output=False), ['MS_Zoning'])])
['MS_Zoning']
OneHotEncoder(drop='first', sparse_output=False)
['MS_SubClass', 'Lot_Frontage', 'Lot_Area', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
filter(regex=("MS_Zoning.*")) ct.transform(ames).
dummy__MS_Zoning_C_all ... dummy__MS_Zoning_Residential_Medium_Density
0 0.0 ... 0.0
1 0.0 ... 0.0
2 0.0 ... 0.0
3 0.0 ... 0.0
4 0.0 ... 0.0
... ... ... ...
2925 0.0 ... 0.0
2926 0.0 ... 0.0
2927 0.0 ... 0.0
2928 0.0 ... 0.0
2929 0.0 ... 0.0
[2930 rows x 6 columns]