Leaf encoding, also called decision tree encoding, is a method where a single decision tree fits using a target, typically the outcome, and a single categorical variable as the predictor. The encoding is then done by using the predictions of the tree to replace the categorical labels.
This should work in both classification and regression settings, but they serve different purposes. If used in a classification setting, we are replacing a categorial predictor with another categorical predictor with fewer levels. For regression settings, we have that the categorical predictor is replaced with a numeric variable. In some ways, this feels much like target encoding explored in Chapter 23.
Suppose we use leaf encoding on the MS_SubClass predictor of the ames data set, using the numeric target Sale_Price. A possible fitted tree on that data would yield the following encoding table.
ββ Conflicts βββββββββββββββββββββββββββββββββββββββββ tidymodels_conflicts() ββ
β purrr::discard() masks scales::discard()
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
β recipes::step() masks stats::step()
β’ Learn how to get started at https://www.tidymodels.org/start/
leaf
MS_SubClass
99353.69
One_Story_1945_and_Older
99353.69
One_and_Half_Story_Unfinished_All_Ages
99353.69
PUD_Multilevel_Split_Level_Foyer
144435.77
One_and_Half_Story_Finished_All_Ages
144435.77
Split_Foyer
144435.77
Two_Story_PUD_1946_and_Newer
144435.77
Split_or_Multilevel
144435.77
Duplex_All_Styles_and_Ages
144435.77
Two_Family_conversion_All_Styles_and_Ages
144435.77
Two_Story_1945_and_Older
144435.77
One_Story_with_Finished_Attic_All_Ages
144435.77
One_and_Half_Story_PUD_All_Ages
190646.00
One_Story_1946_and_Newer_All_Styles
190646.00
One_Story_PUD_1946_and_Newer
190646.00
Two_and_Half_Story_All_Ages
239364.29
Two_Story_1946_and_Newer
This table has 4 different values, meaning that the tree has 4 different leafs. Now prediction happens by using this lookup table.
Instead, letβs see what happens if we choose a categorical target. Using the same MS_SubClass predictor, but instead using the categorical variable Lot_Shape as the target.
leaf
MS_SubClass
leaf1
One_Story_1946_and_Newer_All_Styles
leaf1
One_and_Half_Story_Finished_All_Ages
leaf1
Split_Foyer
leaf1
Two_Story_PUD_1946_and_Newer
leaf1
One_Story_1945_and_Older
leaf1
Duplex_All_Styles_and_Ages
leaf1
Two_Family_conversion_All_Styles_and_Ages
leaf1
One_and_Half_Story_Unfinished_All_Ages
leaf1
Two_Story_1945_and_Older
leaf1
Two_and_Half_Story_All_Ages
leaf1
One_Story_with_Finished_Attic_All_Ages
leaf1
PUD_Multilevel_Split_Level_Foyer
leaf1
One_and_Half_Story_PUD_All_Ages
leaf2
Two_Story_1946_and_Newer
leaf2
One_Story_PUD_1946_and_Newer
leaf2
Split_or_Multilevel
And we now have a mapping that takes 16 levels and compresses them into n_distinct(res$leaf) levels. We note two insights for the categorical target case. Firstly, the number of unique levels canβt exceed the number of levels in the target. Because it is not possible to predict a level that doesnβt exist for the target. Secondly, you will produce the same or fewer levels in your leaf. We saw earlier that it is possible to produce fewer. To produce the same about of levels, we would need a target with the same or more levels than the predictor and have each predictor level map to a different target level.
Since we are fitting a tree, it has the opportunity to be hyper-parameter-tuned, as the size and shape tree will affect the encoding. You will be fitting a different tree for each of the categorical variables you are encoding, and they likely wonβt have the same optimal tree size. Here you have to make a choice. Either meticulously tune each tree in the broader scope of the model, or use decent defaults. The latter choice is likely the best one.
Lastly, this method doesnβt work with unseen levels as talked about in Chapter 17, as decision trees generally donβt have a way to handle unseen levels.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.