20  Ordinal Encoding

This method is similar to Chapter 19, except that we manually specify the mapping. This method is generally used for ordinal variables as they are encoded with a natural ordering. Where this method shines compared to integer encoding is that we allow arbitrary values for encoding, thus we can have (cold = 1, warm = 5, hot = 20). But we might as well use (cold = -1, warm = 0, hot = 1) or (cold = 1.618, warm = 2.718, hot = 3.141). Although you would have a hard time justifying the latter. Nothing is stopping you from using this method with an unordered categorical variable, you just need to spend some time justifying your levels.

Note

This book’s framing of ordinal encoding is more general than other sources, in so far as it is described as manually giving values to levels of a categorical, whether it is ordered or not.

This method feels like it but isn’t a trained method. This is because we are providing the record of the possible values and their corresponding integer value. Unseen levels can be manually specified, but it isn’t entirely obvious what their value should be.

TODO

add diagram

Manually setting values for your levels comes with some upsides and downsides. Assuming that you have the domain expertise to apply numeric values for the levels, removes a lot of the guesswork. This can be very effective if the numeric values selected for the levels have some intrinsic meaning. It can remove a lot of the guesswork and trial and error that we see in integer encoding. The downside is the other side of the coin. We need to have the domain expertise to be able to give the levels meaningful values, otherwise, we are doing no better than integer encoding.

20.2 Pros and Cons

20.2.1 Pros

  • Only produces a single numeric variable for each categorical variable
  • Preserves the natural ordering of ordered values

20.2.2 Cons

  • Will very often give inferior performance compared to other methods
  • Unseen levels need to be manually specified

20.3 R Examples

We will be using the ames data set for these examples. The step_dummy() function allows us to perform dummy encoding and one-hot encoding.

library(recipes)
library(modeldata)
data("ames")

ames |>
  select(Lot_Shape, Land_Slope)
# A tibble: 2,930 Γ— 2
   Lot_Shape          Land_Slope
   <fct>              <fct>     
 1 Slightly_Irregular Gtl       
 2 Regular            Gtl       
 3 Slightly_Irregular Gtl       
 4 Regular            Gtl       
 5 Slightly_Irregular Gtl       
 6 Slightly_Irregular Gtl       
 7 Regular            Gtl       
 8 Slightly_Irregular Gtl       
 9 Slightly_Irregular Gtl       
10 Regular            Gtl       
# β„Ή 2,920 more rows

Looking at the levels of Lot_Shape and Land_Slope we see that they match the levels in the documentation http://jse.amstat.org/v19n3/decock/DataDocumentation.txt. Furthermore, these variables are listed as ordinal, they just aren’t denoted like this in this data set.

ames |> pull(Lot_Shape) |> levels()
[1] "Regular"              "Slightly_Irregular"   "Moderately_Irregular"
[4] "Irregular"           
ames |> pull(Land_Slope) |> levels()
[1] "Gtl" "Mod" "Sev"

We will fix that by turning them into ordered factors.

ames <- ames |>
  mutate(across(.cols = c(Lot_Shape, Land_Slope), .fns = as.ordered))

to perform ordinal encoding we will use the step_ordinalscore() step. This defaults to giving each level values between 1 and n, much like step_integer().

ordinal_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_ordinalscore(Lot_Shape, Land_Slope) |>
  prep()

ordinal_rec |>
  bake(new_data = NULL, starts_with("Lot_Shape"), starts_with("Land_Slope"))
# A tibble: 2,930 Γ— 2
   Lot_Shape Land_Slope
       <int>      <int>
 1         2          1
 2         1          1
 3         2          1
 4         1          1
 5         2          1
 6         2          1
 7         1          1
 8         2          1
 9         2          1
10         1          1
# β„Ή 2,920 more rows

What we can do is define a special transformation function for each of the steps. One way is to use the case_when() function

Lot_Shape_transformer <- function(x) {
  case_when(
    x == "Regular" ~ 0, 
    x == "Slightly_Irregular" ~ -1,
    x == "Moderately_Irregular" ~ -5,
    x == "Irregular" ~ -10
  )
}

If you have the values for each of the levels as a vector or data, you can write the function to use that information as well.

Land_Slope_values <- c(Gtl = 0, Mod = 1, Sev = 5)

Land_Slope_transformer <- function(x) {
  Land_Slope_values[x]
}

With these functions, we can now apply them to the respective columns by using the convert argument.

ordinal_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_ordinalscore(Lot_Shape, convert = Lot_Shape_transformer) |>
  step_ordinalscore(Land_Slope, convert = Land_Slope_transformer) |>
  prep()

ordinal_rec |>
  bake(new_data = NULL, starts_with("Lot_Shape"), starts_with("Land_Slope")) |>
  distinct()
# A tibble: 11 Γ— 2
   Lot_Shape Land_Slope
       <int>      <int>
 1        -1          0
 2         0          0
 3        -5          1
 4        -1          1
 5         0          1
 6        -5          0
 7         0          5
 8        -1          5
 9       -10          0
10        -5          5
11       -10          5

20.4 Python Examples

I’m not aware of a good way to do this in a scikit-learn way. Please file an issue on github if you know of a good way.