5  Yeo-Johnson

You have likely heard a lot of talk about having normally distributed predictors. This isn’t that common of an assumption, and having a fairly non-skewed symmetric predictor is often enough. Linear Discriminant Analysis assumes Gaussian data, and that is about it (TODO add a reference here). Still, it is worthwhile to have more symmetric predictors, and this is where the Yeo-Johnson transformation comes into play.

This method is very similar to the Box-Cox method in Chapter 4, except it doesn’t have the restriction that the variable \(x\) needs to be positive.

It works by using maximum likelihood estimation to estimate a transformation parameter \(\lambda\) in the following equation that would optimize the normality of \(x^*\)

\[ x^* = \left\{ \begin{array}{ll} \dfrac{(x + 1) ^ \lambda - 1}{\lambda} & \lambda \neq 0, x \geq 0 \\ \log(x + 1) & \lambda = 0, x \geq 0 \\ - \dfrac{(-x + 1) ^ {2 - \lambda} - 1}{2 - \lambda} & \lambda \neq 2, x < 0 \\ - \log(-x + 1) & \lambda = 2, x < 0 \end{array} \right. \] It is worth noting again, that what we are optimizing over is the value of \(\lambda\). This is also a case of a trained preprocessing method when used on the predictors. We need to estimate the parameter \(\lambda\) on the training data set, then use the estimated value to apply the transformation to the training and test data set to avoid data leakage.

If the values of \(x\) are strictly positive, then the Yeo-Johnson transformation is the same as the Box-Cox transformation of \(x + 1\), if the values of \(x\) are strictly negative then the transformation is the Box-Cox transformation of \(-x + 1\) with the power \(2 - \lambda\). The interpretation of \(\lambda\) isn’t as easy as for the Box-Cox method.

Let us see some examples of Yeo-Johnson at work. Below is three different simulated distribution, before and after they have been transformed by Yeo-Johnson.

Before and After Yeo-Johnson

We have the original distributions that have some left or right skewness. And the transformed columns look better, in the sense that they are less skewed and they are fairly symmetric around the center. Are they perfectly normal? No! but these transformations might be beneficial. We also notice how these methods work, even when there are negative values.

The Yeo-Johnson method isn’t magic and will only give you something more normally distributed if the distribution can be made more normally distributed by applying the above formula would give you some more normally distributed values.

Before and After Box-Cox

The first distribution here is uniformly random. The resulting transformation ends up more skewed, even if only a little bit, than the original distribution because this method is not intended for this type of data. We are seeing similar results with the bi-modal distributions.

5.2 Pros and Cons

5.2.1 Pros

  • More flexible than individually chosen power transformations such as logarithms and square roots
  • Can handle negative values

5.2.2 Cons

  • Isn’t a universal fix

5.3 R Examples

We will be using the ames data set for these examples.

library(recipes)
library(modeldata)
data("ames")

ames |>
  select(Lot_Area, Wood_Deck_SF, Sale_Price)
# A tibble: 2,930 Γ— 3
   Lot_Area Wood_Deck_SF Sale_Price
      <int>        <int>      <int>
 1    31770          210     215000
 2    11622          140     105000
 3    14267          393     172000
 4    11160            0     244000
 5    13830          212     189900
 6     9978          360     195500
 7     4920            0     213500
 8     5005            0     191500
 9     5389          237     236500
10     7500          140     189000
# β„Ή 2,920 more rows

{recipes} provides a step to perform Yeo-Johnson transformations, which out of the box uses \(e\) as the base with an offset of 0.

yeojohnson_rec <- recipe(Sale_Price ~ Lot_Area, data = ames) |>
  step_YeoJohnson(Lot_Area) |>
  prep()

yeojohnson_rec |>
  bake(new_data = NULL)
# A tibble: 2,930 Γ— 2
   Lot_Area Sale_Price
      <dbl>      <int>
 1     21.8     215000
 2     18.2     105000
 3     18.9     172000
 4     18.1     244000
 5     18.8     189900
 6     17.7     195500
 7     15.5     213500
 8     15.5     191500
 9     15.8     236500
10     16.8     189000
# β„Ή 2,920 more rows

We can also pull out the value of the estimated \(\lambda\) by using the tidy() method on the recipe step.

yeojohnson_rec |>
  tidy(1)
# A tibble: 1 Γ— 3
  terms    value id              
  <chr>    <dbl> <chr>           
1 Lot_Area 0.129 YeoJohnson_3gJXR

5.4 Python Examples

We are using the ames data set for examples. {feature_engine} provided the YeoJohnsonTransformer() that we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from feature_engine.transformation import YeoJohnsonTransformer

ct = ColumnTransformer(
    [('yeojohnson', YeoJohnsonTransformer(), ['Lot_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('yeojohnson', YeoJohnsonTransformer(),
                                 ['Lot_Area'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      yeojohnson__Lot_Area  ... remainder__Latitude
0                   21.823  ...              42.054
1                   18.218  ...              42.053
2                   18.915  ...              42.053
3                   18.082  ...              42.051
4                   18.808  ...              42.061
...                    ...  ...                 ...
2925                16.969  ...              41.989
2926                17.332  ...              41.988
2927                17.861  ...              41.987
2928                17.721  ...              41.991
2929                17.593  ...              41.989

[2930 rows x 74 columns]