66 Too Many Overview

66.1 Too Many Overview

There will be times were you are given a lot of variables in your data set. This by itself may not be a problem. but it could come with some drawbacks. Many models will have slower fit times at best, and worse performance at worst. Not all your variables are likely beneficial in your model. They could be uninformative, correlated or contain redundant information. We look at ways to deal with correlated features, but there will also be methods here to accomplish similar goals.

Suppose the same variable is included twice in your model. Both will not be able to be used in your model at the same time. Once one is included, the other becomes irrelevant. In essence, these two variables are completely correlated, thus we need to deal with this type of problem as well.

The overarching names for these types of methods are dimensionality reduction and feature selection, and we will cover most of these methods here.

66.2 Non-zero Variance filtering

These types of methods are quite simple, we remove variables that take a few number of values. If the value is always 1 then it doesn’t have any information in it and we should remove it. If the variables are almost always the same we might want to remove them. We look at these methods in the Non-zero Variance filtering chapter.

TODO

Figure out where to best write about de-duplication

66.3 Dimensionality reduction

The bulk of the chapter will be in this category. This book categorizes dimensionality reduction methods as methods where a calculation is done on several features, with the same or fewer features being returned. Remember that we only look at methods that can be used in predictive settings, hence we won’t be talking about t-distributed stochastic neighbor embedding t-SNE.

66.4 Feature selection

Feature selection on the other hand finds which variables to keep or remove. And then you act on that. This can be done in a couple of different ways. Filter-based approaches, these methods give each feature a score or rank, and then you use this information to select variables. There are many different ways to get these rankings and many will be covered in the chapter. Wrapper-based approaches. These methods iteratively look at subsets of data to try to find the best set. Their main downside is they tend to add a lot of computational overhead as you need to fit your model many times. Lastly, we have embedded methods. These methods use more advanced methods, sometimes other models, to do the feature selection.