93  Down-Sampling

Downsampling is one of the conceptually easiest ways to handle imbalanced data. Like all other methods in this section, it is a supervised method, since it requires knowledge of the outcome. This also means that it can only be applied to the training data set, since we need the outcome. See also up-dampling, which is the opposite action.

The algorithm is quite simple. Tally the number of observations within each class, as well as keep track of which observations are associated with each class. The class with the fewest observations is then denoted as the minority class, The remaining classes are denoted as the majority classes. The observations of each majority class are then sampled without replacement to decrease the number of observations in the class. Typically, to align with the number of observations with that of the minority class.

One could also modify this for a different threshold, say 300%, which would make all majority classes with more than 30% of the number of observations of the minority class, downsample until they have at most 300% of that of the minority class.

When using downsampling we are deleting data completely. And there is a definite risk that we are throwing away the signal when we do this. This is even more drastic when we have a sharp imbalance between the majority and minority classes. Having a 99-1 split in the data would have you remove 98% of the data set, which is not a wise choice.

93.2 Pros and Cons

93.2.1 Pros

  • Computationally fast and simple.

93.2.2 Cons

  • This could give issues if there is a big imbalance, causing us to remove almost all the data.
  • Solves tasks that could be better handled with other methods.

93.3 R Examples

We will be using the ames data set for these examples.

library(recipes)
library(themis)
library(modeldata)

ames
# A tibble: 2,930 × 74
   MS_SubClass            MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
 * <fct>                  <fct>            <dbl>    <int> <fct>  <fct> <fct>    
 1 One_Story_1946_and_Ne… Resident…          141    31770 Pave   No_A… Slightly…
 2 One_Story_1946_and_Ne… Resident…           80    11622 Pave   No_A… Regular  
 3 One_Story_1946_and_Ne… Resident…           81    14267 Pave   No_A… Slightly…
 4 One_Story_1946_and_Ne… Resident…           93    11160 Pave   No_A… Regular  
 5 Two_Story_1946_and_Ne… Resident…           74    13830 Pave   No_A… Slightly…
 6 Two_Story_1946_and_Ne… Resident…           78     9978 Pave   No_A… Slightly…
 7 One_Story_PUD_1946_an… Resident…           41     4920 Pave   No_A… Regular  
 8 One_Story_PUD_1946_an… Resident…           43     5005 Pave   No_A… Slightly…
 9 One_Story_PUD_1946_an… Resident…           39     5389 Pave   No_A… Slightly…
10 Two_Story_1946_and_Ne… Resident…           60     7500 Pave   No_A… Regular  
# ℹ 2,920 more rows
# ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#   Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …

{themis} provides step_downsample(), which has the implementation for up-sampling.

ames |>
  count(MS_Zoning)
# A tibble: 7 × 2
  MS_Zoning                        n
  <fct>                        <int>
1 Floating_Village_Residential   139
2 Residential_High_Density        27
3 Residential_Low_Density       2273
4 Residential_Medium_Density     462
5 A_agr                            2
6 C_all                           25
7 I_all                            2
upsample_rec <- recipe(MS_Zoning ~ ., data = ames) |>
  step_downsample(MS_Zoning)

upsample_res <- upsample_rec |>
  prep() |>
  bake(new_data = NULL) 

upsample_res
# A tibble: 14 × 74
   MS_SubClass         Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour
   <fct>                      <dbl>    <int> <fct>  <fct> <fct>     <fct>       
 1 One_Story_1946_and…           73     7321 Pave   Paved Slightly… Lvl         
 2 Two_Story_PUD_1946…           30     3182 Pave   Paved Regular   Lvl         
 3 One_Story_1945_and…           70     4270 Pave   No_A… Regular   Bnk         
 4 One_Story_PUD_1946…           33     4113 Pave   No_A… Slightly… Lvl         
 5 Split_Foyer                   54     7244 Pave   No_A… Regular   Lvl         
 6 One_Story_1946_and…            0    36500 Pave   No_A… Slightly… Low         
 7 One_Story_1946_and…           70    12702 Pave   No_A… Regular   Lvl         
 8 One_and_Half_Story…           51     6120 Pave   No_A… Regular   Lvl         
 9 One_Story_1946_and…           80    14584 Pave   No_A… Regular   Low         
10 One_Story_1946_and…          125    31250 Pave   No_A… Regular   Lvl         
11 One_Story_1945_and…          120    18000 Grvl   No_A… Regular   Low         
12 Two_Story_1945_and…            0     6449 Pave   No_A… Slightly… Lvl         
13 One_Story_1945_and…          109    21780 Grvl   No_A… Regular   Lvl         
14 Two_Story_1945_and…            0    56600 Pave   No_A… Slightly… Low         
# ℹ 67 more variables: Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>,
#   Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>,
#   House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
#   Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, …
upsample_res |>
  count(MS_Zoning)
# A tibble: 7 × 2
  MS_Zoning                        n
  <fct>                        <int>
1 Floating_Village_Residential     2
2 Residential_High_Density         2
3 Residential_Low_Density          2
4 Residential_Medium_Density       2
5 A_agr                            2
6 C_all                            2
7 I_all                            2

It includes the over_ratio to only partially update the data set.

upsample_rec <- recipe(MS_Zoning ~ ., data = ames) |>
  step_downsample(MS_Zoning, under_ratio = 10)

upsample_rec |>
  prep() |>
  bake(new_data = NULL) |>
  count(MS_Zoning)
# A tibble: 7 × 2
  MS_Zoning                        n
  <fct>                        <int>
1 Floating_Village_Residential    20
2 Residential_High_Density        20
3 Residential_Low_Density         20
4 Residential_Medium_Density      20
5 A_agr                            2
6 C_all                           20
7 I_all                            2

93.4 Python Examples