We are looking to remove correlated features. By correlated features, we typically talk about Pearson correlation. The methods in this chapter donβt require that Pearson correlation is used, just that any correlation method is used. Next, we will look at how we can remove the offending variables. We will show an iterative approach and a clustering approach. Both methods are learned methods as they identify variables to be kept.
I always find it useful to look at a correlation matrix first
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
βΉ Please use `linewidth` instead.
βΉ The deprecated feature was likely used in the corrr package.
Please report the issue at <https://github.com/tidymodels/corrr/issues>.
Figure 82.1: Some clusters of correlated features.
looking at the above chart we see some correlated features. One way to perform our filtering is to find all the correlated pairs over a certain threshold and remove one of them. Below is a chart of the 10 most correlated pairs
Table 82.1: Some predictors appear multiple times in this table.
x
y
r
Garage_Cars
Garage_Area
0.890
Gr_Liv_Area
TotRms_AbvGrd
0.808
Total_Bsmt_SF
First_Flr_SF
0.800
Gr_Liv_Area
Sale_Price
0.707
Bedroom_AbvGr
TotRms_AbvGrd
0.673
Second_Flr_SF
Gr_Liv_Area
0.655
Garage_Cars
Sale_Price
0.648
Garage_Area
Sale_Price
0.640
Total_Bsmt_SF
Sale_Price
0.633
Gr_Liv_Area
Full_Bath
0.630
One way to do filtering is to pick a threshold and repeatably remove one of the variables of the most correlated pair until there are no pairs left with a correlation over the threshold. This method has a minimal computational footprint as it just needs to calculate the correlations once at the beginning. The threshold is likely to need to be tuned as we canβt say for sure what a good threshold is. With the removal of variables, there is always a chance that we are removing signal rather than noise, this is increasingly true as we remove more and more predictors.
TODO
rewrite this as an algorithm
If we look at the above table, we notice that some of the variables occur together. One such example is Garage_Cars, Garage_Area and Sale_Price. These 3 variables are highly co-correlated and it would be neat if we could deal with these variables at the same time.
TODO
find a different example so Sale_Price isnβt part of this as it is usually the outcome.
TODO
Add a good graph showing this effect.
What we could do, is take the correlation matrix and apply a clustering model on it. Then we use the clustering model to lump together the groups of highly correlated predictors. Then within each cluster, one predictor is chosen to be kept. The clusters should ideally be chosen such that uncorrelated predictors are alone in their cluster. This method can work better with the global structure of the data but requires fitting and tuning another model.
82.2 Pros and Cons
82.2.1 Pros
Computationally simple and fast
Easily explainable. βPredictors were removedβ
Will lead to a faster and simpler model
82.2.2 Cons
Can be hard to justify: βWhy was this predictor kept instead of this one?β
Will lead to loss of signal and performance, with the hope that this loss is kept minimal
82.3 R Examples
We will use the ames data set from {modeldata} in this example. The {recipes} step step_corrr() performs the simple correlation filter described at the beginning of this chapter.