34  Summary Encoding

You can repeat quantile encoding Chapter 33, using using different quantiles for more information extraction, e.i. with 0.25, 0.5, and 0.75 quantile. This is called summary encoding.

One of the downsides of quantile encoding is that you need to pick or tune to find a good quantile. Summary encoding curcomvents this issue by calculating a lot of quantiles at the same time.

34.2 Pros and Cons

34.2.1 Pros

  • Less tuning than quantile encoding

34.2.2 Cons

  • More computational than quantile encoding
  • chance of producing correlated or redundant features

34.3 R Examples

Not yet implemented

1 + 1
[1] 2

34.4 Python Examples

We are using the ames data set for examples. {category_encoders} provided the SummaryEncoder() method we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.quantile_encoder import SummaryEncoder

ct = ColumnTransformer(
    [('summary', SummaryEncoder(), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames, y=ames[["Sale_Price"]].values.flatten())
ColumnTransformer(remainder='passthrough',
                  transformers=[('summary', SummaryEncoder(), ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      summary__MS_Zoning_25  ...  remainder__Latitude
0                137496.482  ...               42.054
1                111612.500  ...               42.053
2                137496.482  ...               42.053
3                137496.482  ...               42.051
4                137496.482  ...               42.061
...                     ...  ...                  ...
2925             137496.482  ...               41.989
2926             137496.482  ...               41.988
2927             137496.482  ...               41.987
2928             137496.482  ...               41.991
2929             137496.482  ...               41.989

[2930 rows x 75 columns]