10  Robust Scaling

Robust scaling, is a scaling method that is typically done by removing the median and dividing by the interquartile range. As illustrated by Equation 10.1

\[ X_{scaled} = \dfrac{X - \text{median}(X)}{\text{Q3}(X) - \text{Q1}(X)} \tag{10.1}\]

This is the most common formulation of this method. This method is a learned transformation. So we use the training data to derive the right values of \(\text{Q3}(X)\), \(\text{Q1}(X)\), and \(\text{median}(X)\) and then these values are used to perform the transformations when applied to new data. You are not bound to use Q1 and Q3, any quantiles can be used. Most software implementations allow you to modify the ranges. It is typically recommended that you pick a symmetric range like \([0.1, 0.9]\) or \([0.3, 0.7]\) unless you have a good reason why observations high or low should be excluded.

This method is normally showcased as a way to scale variables with outliers in them. That is true, so far that you don’t take the outer quantiles into consideration. The default range means that only 50% of the observations are used to calculate the scaling statistics. This is fine if you want to ignore the outliers, however, it is conventionally not a good idea to outright ignore outliers, so you might want to take a look at outlier issues before you throw away the information that is present in the outliers.

10.2 Pros and Cons

10.2.1 Pros

  • Isn’t affected by outliers
  • Transformation can easily be reversed, making its interpretations easier on the original scale

10.2.2 Cons

  • Completely ignores part of the data outside the quantile ranges
  • Doesn’t work with near zero variance data as Q1(x) - Q3(x) = 0, yielding a division by zero
  • Cannot be used with sparse data as it isn’t preserved

10.3 R Examples

We will be using the ames data set for these examples.

# remotes::install_github("emilhvitfeldt/extrasteps")
library(recipes)
library(extrasteps)
library(modeldata)
data("ames")

ames |>
  select(Sale_Price, Lot_Area, Wood_Deck_SF, Mas_Vnr_Area)
# A tibble: 2,930 × 4
   Sale_Price Lot_Area Wood_Deck_SF Mas_Vnr_Area
        <int>    <int>        <int>        <dbl>
 1     215000    31770          210          112
 2     105000    11622          140            0
 3     172000    14267          393          108
 4     244000    11160            0            0
 5     189900    13830          212            0
 6     195500     9978          360           20
 7     213500     4920            0            0
 8     191500     5005            0            0
 9     236500     5389          237            0
10     189000     7500          140            0
# ℹ 2,920 more rows

We will be using the step_robust() step for this, and it can be found in the extrasteps extension package.

maxabs_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_robust(all_numeric_predictors()) |>
  prep()

maxabs_rec |>
  bake(new_data = NULL, Sale_Price, Lot_Area, Wood_Deck_SF, Mas_Vnr_Area)
# A tibble: 2,930 × 4
   Sale_Price Lot_Area Wood_Deck_SF Mas_Vnr_Area
        <int>    <dbl>        <dbl>        <dbl>
 1     215000    5.43         1.25         0.688
 2     105000    0.531        0.833        0    
 3     172000    1.17         2.34         0.664
 4     244000    0.419        0            0    
 5     189900    1.07         1.26         0    
 6     195500    0.132        2.14         0.123
 7     213500   -1.10         0            0    
 8     191500   -1.08         0            0    
 9     236500   -0.984        1.41         0    
10     189000   -0.471        0.833        0    
# ℹ 2,920 more rows

We can also pull out what the max values were for each variable using tidy()

maxabs_rec |>
  tidy(1)
# A tibble: 99 × 4
   terms          statistic  value id          
   <chr>          <chr>      <dbl> <chr>       
 1 Lot_Frontage   lower        43  robust_Bp5vK
 2 Lot_Frontage   median       63  robust_Bp5vK
 3 Lot_Frontage   higher       78  robust_Bp5vK
 4 Lot_Area       lower      7440. robust_Bp5vK
 5 Lot_Area       median     9436. robust_Bp5vK
 6 Lot_Area       higher    11555. robust_Bp5vK
 7 Year_Built     lower      1954  robust_Bp5vK
 8 Year_Built     median     1973  robust_Bp5vK
 9 Year_Built     higher     2001  robust_Bp5vK
10 Year_Remod_Add lower      1965  robust_Bp5vK
# ℹ 89 more rows

We can also change the default range to allow more of the distribution to affect the calculations. This is done using the range argument.

maxabs_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_robust(all_numeric_predictors(), range = c(0.1, 0.9)) |>
  prep()

maxabs_rec |>
  bake(new_data = NULL, Sale_Price, Lot_Area, Wood_Deck_SF, Mas_Vnr_Area)
# A tibble: 2,930 × 4
   Sale_Price Lot_Area Wood_Deck_SF Mas_Vnr_Area
        <int>    <dbl>        <dbl>        <dbl>
 1     215000   2.35          0.820       0.350 
 2     105000   0.230         0.547       0     
 3     172000   0.509         1.53        0.337 
 4     244000   0.181         0           0     
 5     189900   0.463         0.828       0     
 6     195500   0.0570        1.41        0.0625
 7     213500  -0.475         0           0     
 8     191500  -0.467         0           0     
 9     236500  -0.426         0.925       0     
10     189000  -0.204         0.547       0     
# ℹ 2,920 more rows

when we pull out the ranges, we see that they are wider

maxabs_rec |>
  tidy(1)
# A tibble: 99 × 4
   terms          statistic  value id          
   <chr>          <chr>      <dbl> <chr>       
 1 Lot_Frontage   lower         0  robust_RUieL
 2 Lot_Frontage   median       63  robust_RUieL
 3 Lot_Frontage   higher       91  robust_RUieL
 4 Lot_Area       lower      4800  robust_RUieL
 5 Lot_Area       median     9436. robust_RUieL
 6 Lot_Area       higher    14299. robust_RUieL
 7 Year_Built     lower      1925. robust_RUieL
 8 Year_Built     median     1973  robust_RUieL
 9 Year_Built     higher     2006  robust_RUieL
10 Year_Remod_Add lower      1950  robust_RUieL
# ℹ 89 more rows

10.4 Python Examples

We are using the ames data set for examples. {sklearn} provided the RobustScaler() method we can use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler

ct = ColumnTransformer(
    [('robust', RobustScaler(), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',  'Mas_Vnr_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('robust', RobustScaler(),
                                 ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',
                                  'Mas_Vnr_Area'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      robust__Sale_Price  ...  remainder__Latitude
0                  0.655  ...               42.054
1                 -0.655  ...               42.053
2                  0.143  ...               42.053
3                  1.000  ...               42.051
4                  0.356  ...               42.061
...                  ...  ...                  ...
2925              -0.208  ...               41.989
2926              -0.345  ...               41.988
2927              -0.333  ...               41.987
2928               0.119  ...               41.991
2929               0.333  ...               41.989

[2930 rows x 74 columns]

We can also change the default range (0.25, 0.75) to allow more of the distribution to affect the calculations. This is done using the quantile_range argument.

ct = ColumnTransformer(
    [('robust', RobustScaler(quantile_range=(10.0, 90.0)), ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',  'Mas_Vnr_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('robust',
                                 RobustScaler(quantile_range=(10.0, 90.0)),
                                 ['Sale_Price', 'Lot_Area', 'Wood_Deck_SF',
                                  'Mas_Vnr_Area'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      robust__Sale_Price  ...  remainder__Latitude
0                  0.313  ...               42.054
1                 -0.313  ...               42.053
2                  0.068  ...               42.053
3                  0.478  ...               42.051
4                  0.170  ...               42.061
...                  ...  ...                  ...
2925              -0.100  ...               41.989
2926              -0.165  ...               41.988
2927              -0.159  ...               41.987
2928               0.057  ...               41.991
2929               0.159  ...               41.989

[2930 rows x 74 columns]