66 Too Many Overview
66.1 Too Many Overview
There will be times were you are given a lot of variables in your data set. This by itself may not be a problem. but it could come with some drawbacks. Many models will have slower fit times at best, and worse performance at worst. Not all your variables are likely beneficial in your model. They could be uninformative, correlated or contain redundant information. We look at ways to deal with correlated features, but there will also be methods here to accomplish similar goals.
Suppose the same variable is included twice in your model. Both will not be able to be used in your model at the same time. Once one is included, the other becomes irrelevant. In essence, these two variables are completely correlated, thus we need to deal with this type of problem as well.
The overarching names for these types of methods are dimensionality reduction and feature selection, and we will cover most of these methods here.
66.2 Non-zero Variance filtering
These types of methods are quite simple, we remove variables that take a few number of values. If the value is always 1 then it doesnβt have any information in it and we should remove it. If the variables are almost always the same we might want to remove them. We look at these methods in the Non-zero Variance filtering chapter.
Figure out where to best write about de-duplication
66.3 Dimensionality reduction
The bulk of the chapter will be in this category. This book categorizes dimensionality reduction methods as methods where a calculation is done on several features, with the same or fewer features being returned. Remember that we only look at methods that can be used in predictive settings, hence we wonβt be talking about t-distributed stochastic neighbor embedding t-SNE.
66.4 Feature selection
Feature selection on the other hand finds which variables to keep or remove. And then you act on that. This can be done in a couple of different ways. Filter-based approaches, these methods give each feature a score or rank, and then you use this information to select variables. There are many different ways to get these rankings and many will be covered in the chapter.
Wrapper-based approaches, These methods iteratively look at subsets of data to try to find the best set. Methods in this category are Forward Selection, Backward Elimination, Recursive Feature Elimination, and the use of genetic algorithms. These methods revolve around fitting your model with a set of predictors, Then, depending on the method, we remove or add other predictors. Look at the performance of the new set of predictors and pick the one that meets some criteria. This is then repeated for some number of iterations or until a stopping criterion is reached. Their main downside is that they tend to add a lot of computational overhead as you need to fit your model many times.
Lastly, we have embedded methods. These methods are not so much methods as they are a collection of models that have feature selection built into them. You cannot add an embedded method to your pipeline if you have issues with your many predictors; You instead swap your model to one that handles many predictors better. These models include but are not limited to: any model with regularization, decision trees, and boosted trees.
Only filter-based approaches are truly considered feature engineering and will be covered here.