31  M-Estimator Encoding

The M-estimator encoding method is another variation of Target Encoding. This page will explain how M-estimator encoding is different from target encoding, so it is encouraged to read that chapter first.

The idea behind M-estimator encoding is the same as the other target encoding methods. But we are using a different mean, namely M-estimator which is a statistical estimator that is less influenced by extreme values in the target value.

We use the following formula to calculate the effect of each level.

\[ M_i = \dfrac{\text{count}(category_i) \cdot \text{mean}(category_i) + M \cdot \text{mean}(target)}{\text{count}(category_i) + M} \]

Note that it contains a hyperparameter \(M\). This value has to be tuned, and will thus invite data leakage if not tuned correctly.

The method by itself doesn’t perform shrinkage so you run into issues associated with lack of shrinkage.

31.2 Pros and Cons

31.2.1 Pros

  • Robust to extreme values in target

31.2.2 Cons

  • Has to be tuned

31.3 R Examples

Has not yet been implemented.

See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.

31.4 Python Examples

We are using the ames data set for examples. {category_encoders} provided the MEstimateEncoder() method we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.m_estimate import MEstimateEncoder

ct = ColumnTransformer(
    [('mestimate', MEstimateEncoder(), ['MS_Zoning'])], 
    remainder="passthrough")

ct.fit(ames, y=ames[["Sale_Price"]].values.flatten())
ColumnTransformer(remainder='passthrough',
                  transformers=[('mestimate', MEstimateEncoder(),
                                 ['MS_Zoning'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      mestimate__MS_Zoning  ... remainder__Latitude
0               191278.640  ...              42.054
1               138004.645  ...              42.053
2               191278.640  ...              42.053
3               191278.640  ...              42.051
4               191278.640  ...              42.061
...                    ...  ...                 ...
2925            191278.640  ...              41.989
2926            191278.640  ...              41.988
2927            191278.640  ...              41.987
2928            191278.640  ...              41.991
2929            191278.640  ...              41.989

[2930 rows x 74 columns]