86 Imbalanced Overview
86.1 Imbalanced Overview
Whether to keep this chapter in this book or not was considered, as the methods in this section are borderline feature engineering methods.
The potential problem with imbalanced data can most easily be seen in a classification setting. Propose we want to predict if an incoming email is spam or not. Furthermore, we assume that the spam rate is low and around 1% of the incoming emails. If you are not careful, you can easily end up with a model that predicts βnot spamβ all the time, since it will be correct 99% of the times. This is a common scenario, and it happens all the time. One culprit could be that there isnβt enough information in the minority class to be able to distinguish it from the majority class. And that is okay, not all modeling problems are easy. But your model will do its best anyway.
There are several different ways to handle imbalanced data, we will list all the ways in broad strokes, and then cover the methods that we could count as feature engineering.
- using the right performance metrics
- weights
- loss functions
- Calibrating the predictions
- Sampling the data, can be done naively and with more advanced methods
The example above showcases why accuracy as a metric isnβt a good choice when the classes are not uniformly represented. So you can look at other metrics, such as precision, recall, ROC-AUC, or Brier score. You will need to know what metric works best for your project.
Another way we can handle this is by adding weights, either to the observations directly, or in the modeling framework as class weights. Giving your minority class high enough weight forces your model to consider them.
Related to the last point, some methods allow you the user to pass in custom objective functions, this can also be beneficial.
Even if your model performs badly by default, your classification model might still have good separation, just not around the 50% cut-off point. Changing the threshold is another way you can overcome an imbalanced data set.
Lastly, and the ways that will be covered in this book is a sampling of the data. There are several different methods that we will cover in this book. These methods cluster somehow, so for some groups we only explain the general idea.
We can split these methods into two groups, under-sampling methods and over-sampling methods. In the under-sampling method, we are removing observations and in the over-sampling we are adding observations. Adding observations is usually done by basing the new observations on the existing observations, exactly or by interpolation.
Over-sampling methods we will cover are:
Under-sampling methods we will cover are:
- Down-sampling
- NearMiss
- Tomek Links
- Condensed Nearest Neighbor
- Edited Nearest Neighbor
- Instance Hardness Threshold
- One-Sided Selection
Some methods do these methods together. We wonβt consider those methods by themselves and instead let you know that you can do over-sampling followed by under-sampling if you choose.