1 Numeric Overview

1.1 Numeric Overview

Data can come in all shapes and sizes, but the most common one is the numeric variable. These are values such as age, height, deficit, price, and quantity and we call these quantitative variables. They are plentiful in most data sets and immediately usable in all statistical and machine learning models. That being said, this does not imply that certain transformations wouldn’t improve model performance.

The following chapters will focus on different methods that follow 1-to-1 or 1-to-more transformations. These methods will mostly be applied one at a time to each variable. Methods that take many variables and return fewer variables such as dimensionality reduction methods will be covered in the Too Many variables section.

When working with a single variable at a time, there are several problems we can encounter. Understanding the problem and how each method addresses it is key to getting the most out of numeric variables.

The four main types of problems we deal with when working with individual numeric variables are:

Distributional problems
Scaling issues
Non-linear effect
Outliers

1.2 Distributional problems

TODO: add more explanation to this section

logarithm
sqrt
BoxCox
Yeo-Johnson
Percentile

1.3 Scaling issues

The topic of feature scaling is important and used widely in all of machine learning. This chapter will go over what feature scaling is and why we want to use it. The following chapters will each go over a different method of feature scaling.

Note

There is some disagreement about the naming of these topics. These types of methods are called feature scaling and scaling in different fields. This book will call this general class of methods feature scaling and will make notes for each specific method and what other names they go by.

In this book, we will define feature scaling as an operation that modifies variables using multiplication and addition. While broadly defined, the methods typically reduce to the following form:

\[ X_{scaled} = \dfrac{X - a}{b} \tag{1.1}\]

The main difference between the methods is how \(a\) and \(b\) are calculated. These are learned transformation methods. We use the training data to derive the right values of \(a\) and \(b\), and then these values are used to perform the transformations when applied to new data. The different methods might differ on what property is desired for the transformed variables, same range or same spread, but they never change the distribution itself. The power transformations we saw in the Box-Cox and Yeo-Johnson chapters, distort the transformations, where these feature scalings essentially perform a “zooming” effect.

Table 1.1: All feature scaling methods

Method	Definition
Centering	\(X_{scaled} = X - \text{mean}(X)\)
Scaling	\(X_{scaled} = \dfrac{X}{\text{sd}(X)}\)
Max-Abs	\(X_{scaled} = \dfrac{X}{\text{max}(\text{abs}(X))}\)
Normalization	\(X_{scaled} = \dfrac{X - \text{mean}(X)}{\text{sd}(X)}\)
Min-Max	\(X_{scaled} = \dfrac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)}\)
Robust	\(X_{scaled} = \dfrac{X - \text{median}(X)}{\text{Q3}(X) - \text{Q1}(X)}\)

We see here that all the methods in Table 1.1 follow Equation 1.1. Sometimes \(a\) and \(b\) take a value of 0, which is perfectly fine. Centering and scaling when used together is equal to normalization. They are kept separate in the table since they are sometimes used independently. Centering, scaling, and normalization will all be discussed in the Normalization chapter.

There are two main reasons why we want to perform feature scaling. Firstly, many different types of models take the magnitude of the variables into account when fitting the models, so having variables on different scales can be disadvantageous because some variables have high priorities. In turn, we get that the other variables have low priority. Models that work using Euclidean distances like KNN models are affected by this change. Regularized models such as lasso and ridge regression also need to be scaled since the regularization depends on the magnitude of the estimates. Secondly, some algorithms converge much faster when all the variables are on the same scale. These types of models produce the same fit, just at a slower pace than if you don’t scale the variables. Any algorithms using Gradient Descent fit into this category.

TODO

Have a KNN diagram show why this is important List which types of models need feature scaling. Should be a 2 column list. Left=name, right=comment %in% c(no effect, different fit, slow down)

1.4 Non-linear effect

binning
splines
polynomial

TODO

Show different distributions, and how well the different methods do at dealing with them

1.5 Outliers

TODO

Fill in some text here, and list issues

add chapters that can handle outliers

1.6 Other

There are any number of transformations we can apply to numeric data, other functions include:

hyperbolic
Relu
inverse
inverse logit
logit