19 Label Encoding

19.1 Label Encoding

Label encoding (also called integer encoding) is a method that maps the categorical levels into the integers 1 through n where n is the number of levels.

Note

Some implementations map to the values 0 through n - 1. This chapter will talk about this method as if it were mapped to 1 through n.

This method is a trained method since the preprocessor needs to keep a record of the possible values and their corresponding integer value. Unseen levels can be encoded outside the range to be either 0 or n + 1, allowing unseen levels to be handled with minimal extra work.

TODO

add diagram

This method is often not ideal as the ordering of the levels will matter a lot for the performance of the model that needs to make sense of the generated numeric variables. For a variable with the levels “Studio”, “Apartment”, “Loft”, “Duplex”. This variable contains 4! = 4 * 3 * 2 * 1 = 24 different orderings. And since the number of permutations is calculated with factorials, this number goes up fast. With just 10 levels we are looking at 3,628,800 different orderings. Even if some orders are better than others, It would be a very slow task to iterate through to find which ones are good. If you have prior information about the levels, then you should use Chapter 20.

If you are working on an implementation that works with factors, then they will be used. Otherwise, the ordering most likely will be alphabetical or in order of occurrence. You should check the documentation of your implementation to figure out which.

The performance of this method will depend a lot on the model. If you are working with a linear model, then you most likely are out of luck as it wouldn’t be able to use a variable where the values 2, 6, and 10 provide evidence one way, and the rest provide evidence the other way. Three-based models will be able to do better but would do even better if label encoding was applied to begin with.

19.2 Pros and Cons

19.2.1 Pros

Only produces a single numeric variable for each categorical variable
Has a way to handle unseen levels, although poorly

19.2.2 Cons

Ordering of the levels matters a lot!
Will very often give inferior performance compared to other methods.

19.3 R Examples

We will be using the ames data set for these examples. The step_dummy() function allows us to perform dummy encoding and one-hot encoding.

library(recipes)
library(modeldata)
data("ames")

ames |>
  select(Sale_Price, MS_SubClass, MS_Zoning)

# A tibble: 2,930 × 3
   Sale_Price MS_SubClass                         MS_Zoning               
        <int> <fct>                               <fct>                   
 1     215000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 2     105000 One_Story_1946_and_Newer_All_Styles Residential_High_Density
 3     172000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 4     244000 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 
 5     189900 Two_Story_1946_and_Newer            Residential_Low_Density 
 6     195500 Two_Story_1946_and_Newer            Residential_Low_Density 
 7     213500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 8     191500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
 9     236500 One_Story_PUD_1946_and_Newer        Residential_Low_Density 
10     189000 Two_Story_1946_and_Newer            Residential_Low_Density 
# ℹ 2,920 more rows

Looking at the levels of MS_SubClass we see that levels are set in a specific way. It isn’t alphabetical, but there isn’t one clear order. No clarification of the ordering can be done in the data documentation http://jse.amstat.org/v19n3/decock/DataDocumentation.txt.

ames |> pull(MS_SubClass) |> levels()

 [1] "One_Story_1946_and_Newer_All_Styles"      
 [2] "One_Story_1945_and_Older"                 
 [3] "One_Story_with_Finished_Attic_All_Ages"   
 [4] "One_and_Half_Story_Unfinished_All_Ages"   
 [5] "One_and_Half_Story_Finished_All_Ages"     
 [6] "Two_Story_1946_and_Newer"                 
 [7] "Two_Story_1945_and_Older"                 
 [8] "Two_and_Half_Story_All_Ages"              
 [9] "Split_or_Multilevel"                      
[10] "Split_Foyer"                              
[11] "Duplex_All_Styles_and_Ages"               
[12] "One_Story_PUD_1946_and_Newer"             
[13] "One_and_Half_Story_PUD_All_Ages"          
[14] "Two_Story_PUD_1946_and_Newer"             
[15] "PUD_Multilevel_Split_Level_Foyer"         
[16] "Two_Family_conversion_All_Styles_and_Ages"

We will be using the step_integer() step for this, which defaults to 1-based indexing

label_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_integer(all_nominal_predictors()) |>
  prep()

label_rec |>
  bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning"))

# A tibble: 2,930 × 2
   MS_SubClass MS_Zoning
         <int>     <int>
 1           1         3
 2           1         2
 3           1         3
 4           1         3
 5           6         3
 6           6         3
 7          12         3
 8          12         3
 9          12         3
10           6         3
# ℹ 2,920 more rows

19.4 Python Examples

We are using the ames data set for examples. {sklearn} provided the OrdinalEncoder() method we can use. Below we see how it can be used with the MS_Zoning columns. We call this method label encoding only when categories='auto' as it automatically labels 0 to n_categories - 1.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

ct = ColumnTransformer(
    [('label', OrdinalEncoder(categories='auto'), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames)

ColumnTransformer(remainder='passthrough',
                  transformers=[('label', OrdinalEncoder(), ['MS_Zoning'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ct.transform(ames).filter(regex="label.*")

      label__MS_Zoning
0                  5.0
1                  4.0
2                  5.0
3                  5.0
4                  5.0
...                ...
2925               5.0
2926               5.0
2927               5.0
2928               5.0
2929               5.0

[2930 rows x 1 columns]