values | target | leave one out |
---|---|---|
100 | 22 | 6.4 |
10 | 22 | 24.4 |
6 | 22 | 25.2 |
5 | 22 | 25.4 |
3 | 22 | 25.8 |
8 | 22 | 24.8 |
25 Leave One Out Encoding
25.1 Leave One Out Encoding
Leave One Out Encoding, is a variation on target encoding. Where target encoding takes the mean of all rows within each target level, it instead excludes the value of the current row.
One of the main downsides to this approach is that since it needs the target which is most often the outcome and such not available for the test data set, it will thus not be able to do the row-wise adjustment and will behave exactly as the target encoding for the test data set.
What this does in practice is that it shifts the influence of outliers within each level away from the whole group and onto the outlier itself. Consider a level that has the following target values 100, 10, 6, 5, 3, 8
. The target encoded value would be 22
and the leave one out values would be different, but the most different one is the outlier at 100.
Thus we have that target encoding is influenced differently than leave one out encoding is. Which type of influence is better is up to you, the practitioner to determine based on your data and modeling problem.
25.2 Pros and Cons
25.2.1 Pros
- Doesnβt hide the effort of outliers as compared to target encoding.
- Can deal with categorical variables with many levels
- Can deal with unseen levels in a sensible way
25.2.2 Cons
- Only have a meaningful difference compared to target encoding to training data set.
- Can be prone to overfitting
25.3 R Examples
Has not yet been implemented.
See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.
25.4 Python Examples
We are using the ames
data set for examples. {category_encoders} provided the LeaveOneOutEncoder()
method we can use. For this to work, we need to remember to specify an outcome when we fit()
.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.leave_one_out import LeaveOneOutEncoder
from sklearn.preprocessing import TargetEncoder
= ColumnTransformer(
ct 'loo', LeaveOneOutEncoder(), ['Neighborhood'])],
[(="passthrough")
remainder
=ames[["Sale_Price"]].values.flatten()) ct.fit(ames, y
ColumnTransformer(remainder='passthrough', transformers=[('loo', LeaveOneOutEncoder(), ['Neighborhood'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('loo', LeaveOneOutEncoder(), ['Neighborhood'])])
['Neighborhood']
LeaveOneOutEncoder()
['MS_SubClass', 'MS_Zoning', 'Lot_Frontage', 'Lot_Area', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
filter(regex="loo.*") ct.transform(ames).
loo__Neighborhood
0 145097.350
1 145097.350
2 145097.350
3 145097.350
4 190646.576
... ...
2925 162226.632
2926 162226.632
2927 162226.632
2928 162226.632
2929 162226.632
[2930 rows x 1 columns]