86 Imbalanced Overview

86.1 Imbalanced Overview

Whether to keep this chapter in this book or not was considered, as the methods in this section are borderline feature engineering methods.

The potential problem with imbalanced data can most easily be seen in a classification setting. Propose we want to predict if an incoming email is spam or not. Furthermore, we assume that the spam rate is low and around 1% of the incoming emails. If you are not careful, you can easily end up with a model that predicts “not spam” all the time, since it will be correct 99% of the times. This is a common scenario, and it happens all the time. One culprit could be that there isn’t enough information in the minority class to be able to distinguish it from the majority class. And that is okay, not all modeling problems are easy. But your model will do its best anyway.

There are several different ways to handle imbalanced data, we will list all the ways in broad strokes, and then cover the methods that we could count as feature engineering.

using the right performance metrics
weights
loss functions
Calibrating the predictions
Sampling the data, can be done naively and with more advanced methods

The example above showcases why accuracy as a metric isn’t a good choice when the classes are not uniformly represented. So you can look at other metrics, such as precision, recall, ROC-AUC, or Brier score. You will need to know what metric works best for your project.

Another way we can handle this is by adding weights, either to the observations directly, or in the modeling framework as class weights. Giving your minority class high enough weight forces your model to consider them.

Related to the last point, some methods allow you the user to pass in custom objective functions, this can also be beneficial.

Even if your model performs badly by default, your classification model might still have good separation, just not around the 50% cut-off point. Changing the threshold is another way you can overcome an imbalanced data set.

Lastly, and the ways that will be covered in this book is a sampling of the data. There are several different methods that we will cover in this book. These methods cluster somehow, so for some groups we only explain the general idea.

We can split these methods into two groups, under-sampling methods and over-sampling methods. In the under-sampling method, we are removing observations and in the over-sampling we are adding observations. Adding observations is usually done by basing the new observations on the existing observations, exactly or by interpolation.

Over-sampling methods we will cover are:

Under-sampling methods we will cover are:

Some methods do these methods together. We won’t consider those methods by themselves and instead let you know that you can do over-sampling followed by under-sampling if you choose.

TODO: I think we would be able to add some kind of list, regarding what the under sampling methods tries to do. Some tries to remove hard to classify observations, other try to remove easy to classify observations.