This method is similar to Chapter 19, except that we manually specify the mapping. This method is generally used for ordinal variables as they are encoded with a natural ordering. Where this method shines compared to integer encoding is that we allow arbitrary values for encoding, thus we can have (cold = 1, warm = 5, hot = 20). But we might as well use (cold = -1, warm = 0, hot = 1) or (cold = 1.618, warm = 2.718, hot = 3.141). Although you would have a hard time justifying the latter. Nothing is stopping you from using this method with an unordered categorical variable, you just need to spend some time justifying your levels.
Note
This bookβs framing of ordinal encoding is more general than other sources, in so far as it is described as manually giving values to levels of a categorical, whether it is ordered or not.
This method feels like it but isnβt a trained method. This is because we are providing the record of the possible values and their corresponding integer value. Unseen levels can be manually specified, but it isnβt entirely obvious what their value should be.
TODO
add diagram
Manually setting values for your levels comes with some upsides and downsides. Assuming that you have the domain expertise to apply numeric values for the levels, removes a lot of the guesswork. This can be very effective if the numeric values selected for the levels have some intrinsic meaning. It can remove a lot of the guesswork and trial and error that we see in integer encoding. The downside is the other side of the coin. We need to have the domain expertise to be able to give the levels meaningful values, otherwise, we are doing no better than integer encoding.
20.2 Pros and Cons
20.2.1 Pros
Only produces a single numeric variable for each categorical variable
Preserves the natural ordering of ordered values
20.2.2 Cons
Will very often give inferior performance compared to other methods
Unseen levels need to be manually specified
20.3 R Examples
We will be using the ames data set for these examples. The step_dummy() function allows us to perform dummy encoding and one-hot encoding.
Looking at the levels of Lot_Shape and Land_Slope we see that they match the levels in the documentation http://jse.amstat.org/v19n3/decock/DataDocumentation.txt. Furthermore, these variables are listed as ordinal, they just arenβt denoted like this in this data set.
to perform ordinal encoding we will use the step_ordinalscore() step. This defaults to giving each level values between 1 and n, much like step_integer().