6  Percentile Scaling

Percentile scaling (also sometimes called Rank Scaling or quantile scaling) is a method where we apply a non-linear transformation to our data where each value is the percentile of the training data.

Note

The words percentile and quantile describe the same things in different ways. The difference is the representation. Percentiles are reported as percentages such as 10%, 20%, and 30% and quantiles are reported as decimals such as 0.1, 0.2, and 0.3. This chapter will use these words interchangeably.

TODO

Add equation

This does a couple of things for us. It naturally constrains the transformed data into the range \([0, 1]\), and it deals with outlier values nicely in the sense that they don’t change the transformation that much. Moreover, if the testing distribution is close to the training distribution then the transformed distribution would be approximately uniformly distributed between 0 and 1.

TODO

add color-coded pair of distribution, where the color map to the previous location of the data

6.2 Pros and Cons

6.2.1 Pros

  • Transformation isn’t affected much by outliers

6.2.2 Cons

  • Doesn’t allow to exact reverse transformation

  • Isn’t ideal if training data doesn’t have that many unique values

6.3 R Examples

We will be using the ames data set for these examples.

library(recipes)
library(modeldata)
data("ames")

ames |>
  select(Lot_Area, Wood_Deck_SF, Sale_Price)
# A tibble: 2,930 Γ— 3
   Lot_Area Wood_Deck_SF Sale_Price
      <int>        <int>      <int>
 1    31770          210     215000
 2    11622          140     105000
 3    14267          393     172000
 4    11160            0     244000
 5    13830          212     189900
 6     9978          360     195500
 7     4920            0     213500
 8     5005            0     191500
 9     5389          237     236500
10     7500          140     189000
# β„Ή 2,920 more rows

The {recipes} step to do this transformation is step_percentile(). It defaults to the calculation of 100 percentiles and uses those to transform the data

percentile_rec <- recipe(Sale_Price ~ Lot_Area, data = ames) |>
  step_percentile(Lot_Area) |>
  prep()

percentile_rec |>
  bake(new_data = NULL)
# A tibble: 2,930 Γ— 2
   Lot_Area Sale_Price
      <dbl>      <int>
 1    0.989     215000
 2    0.756     105000
 3    0.898     172000
 4    0.717     244000
 5    0.883     189900
 6    0.580     195500
 7    0.104     213500
 8    0.106     191500
 9    0.120     236500
10    0.259     189000
# β„Ή 2,920 more rows

We can use the tidy() method to pull out what the specific values are for each percentile

percentile_rec |>
  tidy(1)
# A tibble: 99 Γ— 4
   terms    value percentile id              
   <chr>    <dbl>      <dbl> <chr>           
 1 Lot_Area 1300           0 percentile_Bp5vK
 2 Lot_Area 1680           1 percentile_Bp5vK
 3 Lot_Area 2040.          2 percentile_Bp5vK
 4 Lot_Area 2362.          3 percentile_Bp5vK
 5 Lot_Area 2779.          4 percentile_Bp5vK
 6 Lot_Area 3188.          5 percentile_Bp5vK
 7 Lot_Area 3674.          6 percentile_Bp5vK
 8 Lot_Area 3901.          7 percentile_Bp5vK
 9 Lot_Area 4122.          8 percentile_Bp5vK
10 Lot_Area 4435           9 percentile_Bp5vK
# β„Ή 89 more rows

You can change the granularity by using the options argument. In this example, we are calculating 500 points evenly spaced between 0 and 1, both inclusive.

percentile500_rec <- recipe(Sale_Price ~ Lot_Area, data = ames) |>
  step_percentile(Lot_Area, options = list(probs = (0:500)/500)) |>
  prep()

percentile500_rec |>
  bake(new_data = NULL)
# A tibble: 2,930 Γ— 2
   Lot_Area Sale_Price
      <dbl>      <int>
 1    0.989     215000
 2    0.755     105000
 3    0.899     172000
 4    0.717     244000
 5    0.884     189900
 6    0.580     195500
 7    0.103     213500
 8    0.106     191500
 9    0.118     236500
10    0.254     189000
# β„Ή 2,920 more rows

And we can see the more precise numbers.

percentile500_rec |>
  tidy(1)
# A tibble: 457 Γ— 4
   terms    value percentile id              
   <chr>    <dbl>      <dbl> <chr>           
 1 Lot_Area 1300         0   percentile_RUieL
 2 Lot_Area 1487.        0.2 percentile_RUieL
 3 Lot_Area 1531.        0.4 percentile_RUieL
 4 Lot_Area 1605.        0.6 percentile_RUieL
 5 Lot_Area 1680         0.8 percentile_RUieL
 6 Lot_Area 1879.        1.4 percentile_RUieL
 7 Lot_Area 1890         1.6 percentile_RUieL
 8 Lot_Area 1946.        1.8 percentile_RUieL
 9 Lot_Area 2040.        2   percentile_RUieL
10 Lot_Area 2136.        2.2 percentile_RUieL
# β„Ή 447 more rows

Notice how there are only 457 values in this output. This is happening because some percentile has been collapsed to save space since if the value for the 10.4 and 10.6 percentile is the same, we just store the 10.6 value.

6.4 Python Examples

We are using the ames data set for examples. {sklearn} provided the QuantileTransformer() method we can use. We can use the n_quantiles argument to change the number of quantiles to use.

from feazdata import ames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import QuantileTransformer

ct = ColumnTransformer(
    [('Quantile', QuantileTransformer(n_quantiles = 500), ['Lot_Area'])], 
    remainder="passthrough")

ct.fit(ames)
ColumnTransformer(remainder='passthrough',
                  transformers=[('Quantile',
                                 QuantileTransformer(n_quantiles=500),
                                 ['Lot_Area'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ct.transform(ames)
      Quantile__Lot_Area  ... remainder__Latitude
0                  0.989  ...              42.054
1                  0.755  ...              42.053
2                  0.899  ...              42.053
3                  0.717  ...              42.051
4                  0.884  ...              42.061
...                  ...  ...                 ...
2925               0.300  ...              41.989
2926               0.427  ...              41.988
2927               0.639  ...              41.987
2928               0.586  ...              41.991
2929               0.536  ...              41.989

[2930 rows x 74 columns]