87 Up-Sampling

87.1 Up-Sampling

Upsampling is one of the conceptually easiest ways to handle imbalanced data. Like all other methods in this section, it is a supervised method, since it requires knowledge of the outcome. This also means that it can only be applied to the training data set, since we need the outcome. See also down-dampling, which is the opposite action.

The algorithm is quite simple. Tally the number of observations within each class, as well as keep track of which observations are associated with each class. The class with the most observations is then denoted as the majority class, The remaining classes are denoted as the minority classes. The observations of each minority class are then sampled with replacement to increase the number of observations in the class. Typically, to align with the number of observations with that of the majority class.

One could also modify this for a different threshold, say 80%, which would make all minority classes with less than 80% of the number of observations of the majority class, upsample until they have at least 80% of that of the majority class.

This action can drastically increase the number of observations depending on the number of minority classes and the ratio between them. A 90-10 split would result in a 80% increase in data, and a 90-5-5 split would result in a 170% increase. No data is being deleted in this modification as we are just adding more rows, which are duplicates of existing rows.

You could think of this as a stochastic version of case weights without the compactness.

87.2 Pros and Cons

87.2.1 Pros

Computationally fast and simple.

87.2.2 Cons

Could give issues with space for large data sets.
Solves tasks that could be better handled with other methods.

87.3 R Examples

We will be using the ames data set for these examples.

library(recipes)
library(themis)
library(modeldata)

ames

# A tibble: 2,930 × 74
   MS_SubClass            MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
 * <fct>                  <fct>            <dbl>    <int> <fct>  <fct> <fct>    
 1 One_Story_1946_and_Ne… Resident…          141    31770 Pave   No_A… Slightly…
 2 One_Story_1946_and_Ne… Resident…           80    11622 Pave   No_A… Regular  
 3 One_Story_1946_and_Ne… Resident…           81    14267 Pave   No_A… Slightly…
 4 One_Story_1946_and_Ne… Resident…           93    11160 Pave   No_A… Regular  
 5 Two_Story_1946_and_Ne… Resident…           74    13830 Pave   No_A… Slightly…
 6 Two_Story_1946_and_Ne… Resident…           78     9978 Pave   No_A… Slightly…
 7 One_Story_PUD_1946_an… Resident…           41     4920 Pave   No_A… Regular  
 8 One_Story_PUD_1946_an… Resident…           43     5005 Pave   No_A… Slightly…
 9 One_Story_PUD_1946_an… Resident…           39     5389 Pave   No_A… Slightly…
10 Two_Story_1946_and_Ne… Resident…           60     7500 Pave   No_A… Regular  
# ℹ 2,920 more rows
# ℹ 67 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#   Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …

{themis} provides step_upsample(), which has the implementation for up-sampling.

ames |>
  count(MS_Zoning)

# A tibble: 7 × 2
  MS_Zoning                        n
  <fct>                        <int>
1 Floating_Village_Residential   139
2 Residential_High_Density        27
3 Residential_Low_Density       2273
4 Residential_Medium_Density     462
5 A_agr                            2
6 C_all                           25
7 I_all                            2

upsample_rec <- recipe(MS_Zoning ~ ., data = ames) |>
  step_upsample(MS_Zoning)

upsample_res <- upsample_rec |>
  prep() |>
  bake(new_data = NULL) 

upsample_res

# A tibble: 15,911 × 74
   MS_SubClass         Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour
   <fct>                      <dbl>    <int> <fct>  <fct> <fct>     <fct>       
 1 One_Story_1946_and…           73     7321 Pave   Paved Slightly… Lvl         
 2 Two_Story_PUD_1946…           30     3182 Pave   Paved Regular   Lvl         
 3 One_Story_1946_and…           75     9000 Pave   No_A… Regular   Lvl         
 4 One_Story_PUD_1946…            0     4765 Pave   No_A… Slightly… Lvl         
 5 Two_Story_PUD_1946…            0     5105 Pave   No_A… Moderate… Lvl         
 6 Two_Story_PUD_1946…           30     3180 Pave   Paved Regular   Lvl         
 7 Two_Story_1946_and…           85    10800 Pave   No_A… Regular   Lvl         
 8 Two_Story_PUD_1946…           35     4017 Pave   Paved Slightly… Lvl         
 9 Two_Story_PUD_1946…           24     2280 Pave   Paved Regular   Lvl         
10 Two_Story_PUD_1946…            0     2651 Pave   No_A… Regular   Lvl         
# ℹ 15,901 more rows
# ℹ 67 more variables: Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>,
#   Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>,
#   House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …

upsample_res |>
  count(MS_Zoning)

# A tibble: 7 × 2
  MS_Zoning                        n
  <fct>                        <int>
1 Floating_Village_Residential  2273
2 Residential_High_Density      2273
3 Residential_Low_Density       2273
4 Residential_Medium_Density    2273
5 A_agr                         2273
6 C_all                         2273
7 I_all                         2273

It includes the over_ratio to only partially update the data set.

upsample_rec <- recipe(MS_Zoning ~ ., data = ames) |>
  step_upsample(MS_Zoning, over_ratio = 0.5)

upsample_rec |>
  prep() |>
  bake(new_data = NULL) |>
  count(MS_Zoning)

# A tibble: 7 × 2
  MS_Zoning                        n
  <fct>                        <int>
1 Floating_Village_Residential  1136
2 Residential_High_Density      1136
3 Residential_Low_Density       2273
4 Residential_Medium_Density    1136
5 A_agr                         1136
6 C_all                         1136
7 I_all                         1136

87.1 Up-Sampling

87.2 Pros and Cons

87.2.1 Pros

87.2.2 Cons

87.3 R Examples

87.4 Python Examples