dow | weather | dow_weather |
---|---|---|
monday | sunny | monday_sunny |
tuesday | rainy | tuesday_rainy |
wednesday | cloudy | wednesday_cloudy |
thursday | sunny | thursday_sunny |
friday | sunny | friday_sunny |
36 Categorical Combination
36.1 Categorical Combination
When you have multiple categorical variables, it might be the case that you want to see whether certain levels co-occur. For the variables day_of_week
and weather
, finding "rainy Monday"
s can be hard for some model types.
Categorical combination does what the name suggests. We are combining categorical variables one to one. We would multiply two numeric variables much the same way. The following table shows how we would combine a day of the week and weather variable.
We noticed a couple of things during this interaction. The first thing is that the number of unique levels goes up multiplicatively, so the newly created variable will include many levels. Depending on what the original variables encode, you will get impossible combinations or rarely occurring levels. This can be handled partly by Collapsing Categories.
It is not guaranteed that the combination feature is going to be better than the individual features before. It can be beneficial to keep both the individual features as long as the combination feature. In the above example, you can imagine that Fridays and rainy Mondays are high signals. If that is the case, then keeping all variables would be the best decision.
This method is not confined to just 2 variables, you could combine as many variables as you want. But the problems described with increased number of levels will be exacerbated.
This method will be hard to automatically employ and will thus need to be done manually. Despite this, it can provide a good boost in performance when done right.
36.2 Pros and Cons
36.2.1 Pros
- Easy to perform
- Produces easily explanatory features
36.2.2 Cons
- Canβt be done automatically
- Produces categorical variables with many levels
36.3 R Examples
Has not yet been implemented.
See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.
36.4 Python Examples
Iβm not aware of a good way to do this in a scikit-learn way. Please file an issue on github if you know of a good way.
See https://github.com/EmilHvitfeldt/feature-engineering-az/issues/40 for progress.