leaf | MS_SubClass |
---|---|
99353.69 | One_Story_1945_and_Older |
99353.69 | One_and_Half_Story_Unfinished_All_Ages |
99353.69 | PUD_Multilevel_Split_Level_Foyer |
144435.77 | One_and_Half_Story_Finished_All_Ages |
144435.77 | Split_Foyer |
144435.77 | Two_Story_PUD_1946_and_Newer |
144435.77 | Split_or_Multilevel |
144435.77 | Duplex_All_Styles_and_Ages |
144435.77 | Two_Family_conversion_All_Styles_and_Ages |
144435.77 | Two_Story_1945_and_Older |
144435.77 | One_Story_with_Finished_Attic_All_Ages |
144435.77 | One_and_Half_Story_PUD_All_Ages |
190646.00 | One_Story_1946_and_Newer_All_Styles |
190646.00 | One_Story_PUD_1946_and_Newer |
190646.00 | Two_and_Half_Story_All_Ages |
239364.29 | Two_Story_1946_and_Newer |
26 Leaf Encoding
26.1 Leaf Encoding
Leaf encoding, also called decision tree encoding, is a method where a single decision tree fits using a target, typically the outcome, and a single categorical variable as the predictor. The encoding is then done by using the predictions of the tree to replace the categorical labels.
This should work in both classification and regression settings, but they serve different purposes. If used in a classification setting, we are replacing a categorial predictor with another categorical predictor with fewer levels. For regression settings, we have that the categorical predictor is replaced with a numeric variable. In some ways, this feels much like target encoding explored in Chapter 23.
Suppose we use leaf encoding on the MS_SubClass
predictor of the ames
data set, using the numeric target Sale_Price
. A possible fitted tree on that data would yield the following encoding table.
This table has 4 different values, meaning that the tree has 4 different leafs. Now prediction happens by using this lookup table.
Instead, letβs see what happens if we choose a categorical target. Using the same MS_SubClass
predictor, but instead using the categorical variable Lot_Shape
as the target.
leaf | MS_SubClass |
---|---|
leaf1 | One_Story_1946_and_Newer_All_Styles |
leaf1 | One_and_Half_Story_Finished_All_Ages |
leaf1 | Split_Foyer |
leaf1 | Two_Story_PUD_1946_and_Newer |
leaf1 | One_Story_1945_and_Older |
leaf1 | Duplex_All_Styles_and_Ages |
leaf1 | Two_Family_conversion_All_Styles_and_Ages |
leaf1 | One_and_Half_Story_Unfinished_All_Ages |
leaf1 | Two_Story_1945_and_Older |
leaf1 | Two_and_Half_Story_All_Ages |
leaf1 | One_Story_with_Finished_Attic_All_Ages |
leaf1 | PUD_Multilevel_Split_Level_Foyer |
leaf1 | One_and_Half_Story_PUD_All_Ages |
leaf2 | Two_Story_1946_and_Newer |
leaf2 | One_Story_PUD_1946_and_Newer |
leaf2 | Split_or_Multilevel |
And we now have a mapping that takes 16 levels and compresses them into n_distinct(res$leaf)
levels. We note two insights for the categorical target case. Firstly, the number of unique levels canβt exceed the number of levels in the target. Because it is not possible to predict a level that doesnβt exist for the target. Secondly, you will produce the same or fewer levels in your leaf. We saw earlier that it is possible to produce fewer. To produce the same about of levels, we would need a target with the same or more levels than the predictor and have each predictor level map to a different target level.
Since we are fitting a tree, it has the opportunity to be hyper-parameter-tuned, as the size and shape tree will affect the encoding. You will be fitting a different tree for each of the categorical variables you are encoding, and they likely wonβt have the same optimal tree size. Here you have to make a choice. Either meticulously tune each tree in the broader scope of the model, or use decent defaults. The latter choice is likely the best one.
Lastly, this method doesnβt work with unseen levels as talked about in Chapter 17, as decision trees generally donβt have a way to handle unseen levels.
26.2 Pros and Cons
26.2.1 Pros
- Produces a single column.
26.2.2 Cons
- Doesnβt handle unseen levels.
- Can be unstable, due to using a decision tree.
- It may be overly simplistic.
26.3 R Examples
Has not yet been implemented.
See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.
26.4 Python Examples
We are using the ames
data set for examples. {feature_engine} provided the YeoJohnsonTransformer()
that we can use.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from feature_engine.encoding import DecisionTreeEncoder
= ColumnTransformer(
ct 'treeEncoding', DecisionTreeEncoder(), ['MS_SubClass'])],
[(="passthrough")
remainder
=ames[["Sale_Price"]].values.flatten()) ct.fit(ames, y
ColumnTransformer(remainder='passthrough', transformers=[('treeEncoding', DecisionTreeEncoder(), ['MS_SubClass'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('treeEncoding', DecisionTreeEncoder(), ['MS_SubClass'])])
['MS_SubClass']
DecisionTreeEncoder()
['MS_Zoning', 'Lot_Frontage', 'Lot_Area', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
filter(regex="treeEncoding.*") ct.transform(ames).
treeEncoding__MS_SubClass
0 187355.694
1 187355.694
2 187355.694
3 187355.694
4 239364.285
... ...
2925 168009.364
2926 187355.694
2927 138618.386
2928 187355.694
2929 239364.285
[2930 rows x 1 columns]