color | target | currentCount | totalCount | catboost |
---|---|---|---|---|
red | yes | 0 | 0 | 0.050 |
red | yes | 1 | 1 | 0.525 |
green | no | 0 | 0 | 0.050 |
green | yes | 0 | 1 | 0.025 |
red | no | 2 | 2 | 0.683 |
green | no | 1 | 2 | 0.350 |
blue | yes | 0 | 0 | 0.050 |
28 Catboost Encoding
28.1 Catboost Encoding
Also known as ordered target encoding, is an extension of target encoding as seen in Chapter 23. First proposed as a part of CatBoost(Prokhorenkova et al. 2019).
In regular target encoding, we can calculate the encoding at once for each level on the predictor we are working with. Ordered target encoding, as the name suggests, imposes an ordering to the observations, Then the target statistics are calculated only for previous observations, in the hopes that this will reduce target leakage.
This is one of the few feature engineering methods where the row order matters. This means the initial ordering of your data matters unless the implementation you are working with can do sampling.
The general formula used to calculate the encoding is as follows:
\[ \dfrac{currentCount + prior}{totalCount + 1} \]
Where \(currentCount\) is the number of times the target class has occured for this predictor level. \(totalCount\) is the number of times the predictor level has occured. \(prior\) is some constant value, typically defaulted to \(0.05\).
There is a handful of variations of how this formula, depending on the target type. See CatBoost documentation for specifics.
Notice the above formulation assumes a classification setting, regression is usually done by running quantization on the numeric target inside the application.
Below we have a worked example. We are using color
as the categorical variable we are going to encode, using target
as the target, and using "yes"
as the target class we are looking for. The first row is the trivial case since it is the first occurrence of "red"
. It will thus have the value of prior
since currentCount
and totalCount
are both equal to 0. The next row is another "red"
, so we just count how many previous values of "red"
we have, which is 1, and set totalCount
as that value. Then we count how many times target
is equal to "yes"
in those instances, which is also 1, and we set currentCount
to 1 as well. This gives us (1 + 0.05) / (1 + 1) = 525
. The third row is another trivial case. The fourth row has totalCount = 1
and currentCount = 0
since the previous value of "green"
didnβt have a target of "yes"
. Follow rows are calculated using the same principles.
One of the stated downsides of using this method outside of catboost itselt is that since the encoding happens on order, you end up with an encoding where the amount of information isnβt uniformally spread over the observation. Instead, you have that the first observations have low information and the last observations have high information. This is of cause after shuffling if that has taken place. This is not a problem inside catboost as the shuffling of the order before applying the encoding is part of what makes stochastic gradient descent work for catboost. It is worth keeping this in mind when you are doing performance checks of your fitted models.
The way we apply this method to new observations, such as in the testing set, is you pretend that each row would have been appended to the training set, apply the encoding, and then remove it. Rinse and repeat for the remaining observation. That is to say that we use the currentCount
and totalCount
but we do not update them.
28.2 Pros and Cons
28.2.1 Pros
- Can deal with categorical variables with many levels
- Can deal with unseen levels in a sensible way
28.2.2 Cons
- Uneven effect for different observations
28.3 R Examples
Has not yet been implemented.
See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.
28.4 Python Examples
We are using the ames
data set for examples. {category_encoders} provided the CatBoostEncoder()
method we can use.
from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.cat_boost import CatBoostEncoder
= ColumnTransformer(
ct 'catboost', CatBoostEncoder(), ['MS_Zoning'])],
[(="passthrough")
remainder
=ames[["Sale_Price"]].values.flatten()) ct.fit(ames, y
ColumnTransformer(remainder='passthrough', transformers=[('catboost', CatBoostEncoder(), ['MS_Zoning'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('catboost', CatBoostEncoder(), ['MS_Zoning'])])
['MS_Zoning']
CatBoostEncoder()
['MS_SubClass', 'Lot_Frontage', 'Lot_Area', 'Street', 'Alley', 'Lot_Shape', 'Land_Contour', 'Utilities', 'Lot_Config', 'Land_Slope', 'Neighborhood', 'Condition_1', 'Condition_2', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Year_Remod_Add', 'Roof_Style', 'Roof_Matl', 'Exterior_1st', 'Exterior_2nd', 'Mas_Vnr_Type', 'Mas_Vnr_Area', 'Exter_Cond', 'Foundation', 'Bsmt_Cond', 'Bsmt_Exposure', 'BsmtFin_Type_1', 'BsmtFin_SF_1', 'BsmtFin_Type_2', 'BsmtFin_SF_2', 'Bsmt_Unf_SF', 'Total_Bsmt_SF', 'Heating', 'Heating_QC', 'Central_Air', 'Electrical', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Functional', 'Fireplaces', 'Garage_Type', 'Garage_Finish', 'Garage_Cars', 'Garage_Area', 'Garage_Cond', 'Paved_Drive', 'Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch', 'Pool_Area', 'Pool_QC', 'Fence', 'Misc_Feature', 'Misc_Val', 'Mo_Sold', 'Year_Sold', 'Sale_Type', 'Sale_Condition', 'Sale_Price', 'Longitude', 'Latitude']
passthrough
filter(regex="catboost.*") ct.transform(ames).
catboost__MS_Zoning
0 191278.640
1 138004.645
2 191278.640
3 191278.640
4 191278.640
... ...
2925 191278.640
2926 191278.640
2927 191278.640
2928 191278.640
2929 191278.640
[2930 rows x 1 columns]