33  Quantile Encoding

Quantile encoding (Mougan et al. 2021), is a reimagined version of Target Encoding and M-estimator Encoding that uses quantiles instead of means and M regulatization from M-estimator.

Whereas target encoding uses the mean as an aggregation function, quantile encoding uses any quantile as its aggregation function. Most of the things we know about target encoding are also true for quantile encoding. The differences come with how quantiles differ from means. Quantiles are generally more robust to outliers, for quantiles away from the end. This same pattern is mirrored in quantile encoding.

Quantile encoding is suggested to be paired with M-estimator style regularization to deal with the issue of having smaller groups.

The following formula is used to calculate the quantile encodings.

\[ QE_i = \dfrac{q(category_i) \cdot n_i + q(whole) \cdot M}{n_i + M} \]

\(QE_i\) is the encoding value for the \(i\)’th category. \(q(category_i)\) is the quantile of the values within the \(i\)’th category, \(q(whole)\) is the quantile of the whole data set. \(n_i\) is the number of observations in the \(i\)’th category and \(M\) is the hyperparameter \(M\) that handles the regularization.

In essense we have 2 hyper parameters for this style on encoding, one is \(M\) which we very much has to tune, and the other one is the quantile of choice. We could set the quantile to specific values, such as 0.5 for median, but tuning it is likely to give better results. But this again is a trade-off between computational time and performance.

33.2 Pros and Cons

33.2.1 Pros

  • less prone to outliers compared to target encoding

33.2.2 Cons

  • has hyperparameters in need of tuning

33.3 R Examples

Has not yet been implemented.

See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.

33.4 Python Examples

We are using the ames data set for examples. {category_encoders} provided the QuantileEncoder() method we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.quantile_encoder import QuantileEncoder

ct = ColumnTransformer(
    [('quantile', QuantileEncoder(), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames, y=ames[["Sale_Price"]].values.flatten())
ColumnTransformer(remainder='passthrough',
                  transformers=[('quantile', QuantileEncoder(), ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      quantile__MS_Zoning  ... remainder__Latitude
0              171994.723  ...              42.054
1              140714.286  ...              42.053
2              171994.723  ...              42.053
3              171994.723  ...              42.051
4              171994.723  ...              42.061
...                   ...  ...                 ...
2925           171994.723  ...              41.989
2926           171994.723  ...              41.988
2927           171994.723  ...              41.987
2928           171994.723  ...              41.991
2929           171994.723  ...              41.989

[2930 rows x 74 columns]