15 Categorical Overview
15.1 Categorical Overview
One of the most common types of data you can encounter is categorical variables. These variables take non-numeric values and are things like; names
, animal
, housetype
zipcode
, mood
, and so on. We call these qualitative variables and you will have to find a way to deal with them as they are plentiful in most data sets you will encounter. You will, however, unlike numerical variables, have to transform these variables into numeric variables as most models only work with numeric variables. The main exception to this is tree-based models such as decision trees, random forests, and boosted trees. This is of cause a theoretical fact, as some implementations of tree-based models donβt support categorical variables.
The above description is a small but useful simplification. A categorical variable can take a known or an unknown number of unique values. day_of_the_week
and zipcode
are examples of variables with a fixed known number of values. Even if our data only contains Sundays, Mondays, and Thursdays we know that there are 7 different possible options. On the other hand, there are plenty of categorical variables where the levels are realistically unknown, such as company_name
, street_name
, and food_item
. This distinction matters as it can be used to inform the pre-processing that will be done.
Another nuance is some categorical variables have an inherent order to them. So the variables size
with the values "small"
, "medium"
, and "large"
, clearly have an ordering to then "small" < "medium" < "large"
. This is unlike the theoretical car_color
with the values βblueβ,
βredβ, and
βblackβ`, which doesnβt have a natural ordering. Depending on whether a variable has an ordering, we can use that information. This is something that doesnβt have to be dealt with, but the added information can be useful.
Lastly, we have the encoding these categorical variables have. The same variable household_pet
can be encoded as ["cat", "cat", "dog", "cat", "hamster"]
or as [1, 1, 2, 1, 3]
. The latter (hopefully) is accompanied by a data dictionary saying [1 = "cat", 2 = "dog", 3 = "hamster"]
. These variables contain the exact same information, but the encoding is vastly different, and if you are not careful to treat household_pet
as a categorical variable the model believes that "hamster" - "cat" = "dog"
.
find an example of the above data encoding in the wild
add a section about data formats that preserve levels
The chapters in this section can be put into 2 categories:
15.2 Categorical to Categorical
These methods take a categorical variable and improve them. Whether it means cleaning levels, collapsing levels, or making sure it handles new levels correctly. These Tasks as not always needed depending on the method you are using but they are generally helpful to apply. One method that would have been located here if it wasnβt for the fact that it has a whole section by itself is dealing with missing values.
15.3 Categorical to Numerical
The vast majority of the chapters in these chapters concern methods that take a categorical variable and produce one or more numerical variables suitable for modeling. There are quite a lot of different methods, all have upsides and downsides and they will all be explored in the remaining chapters.
These methods can further be categorized. This categorization allows us to better compare and contrast the methods to see how they are similar and different. The vast majority of methods are applied one column at a time. The notable exception is the combination multi-dummy extraction method.
The one-column-at-a-time methods can be put into 3 groups; dummy-based, target-based, and other.
The methods that are variations of Dummy Encoding are:
The methods that are variations of Target Encoding are:
- Leave One Out Encoding
- Leaf Encoding
- GLMM Encoding
- CatBoost Encoding
- James-Stein encoding
- M-Estimator Encoding
- Quantile Encoding
- Summary Encoding
The remaining methods donβt neatly slot into the above categories:
What you might notice is that many of these methods boil down to a left join when applied to new data. The difference is how you calculate the table, which we could consider an embedding table.
This idea gives rise to another way to deal with categorical variables. This is done by manually creating enrichment tables and then using them with categorical variables. One could imagine having a city
predictor. It could be effectively encoded by the methods seen in this section. But we might be able to add more signals by enriching with characteristics of the city, like population
, n_hospitals
, and country
It can be seen as a simple version of relational methods as seen in Chapter 136.