15  Categorical Overview

One of the most common types of data you can encounter is categorical variables. These variables take non-numeric values and are things like; names, animal, housetype zipcode, mood, and so on. We call these qualitative variables and you will have to find a way to deal with them as they are plentiful in most data sets you will encounter. You will, however, unlike numerical variables, have to transform these variables into numeric variables as most models only work with numeric variables. The main exception to this is tree-based models such as decision trees, random forests, and boosted trees. This is of cause a theoretical fact, as some implementations of tree-based models don’t support categorical variables.

The above description is a small but useful simplification. A categorical variable can take a known or an unknown number of unique values. day_of_the_week and zipcode are examples of variables with a fixed known number of values. Even if our data only contains Sundays, Mondays, and Thursdays we know that there are 7 different possible options. On the other hand, there are plenty of categorical variables where the levels are realistically unknown, such as company_name, street_name, and food_item. This distinction matters as it can be used to inform the pre-processing that will be done.

Another nuance is some categorical variables have an inherent order to them. So the variables size with the values "small", "medium", and "large", clearly have an ordering to then "small" < "medium" < "large". This is unlike the theoretical car_color with the values β€œblue”,β€œred”, andβ€œblack”`, which doesn’t have a natural ordering. Depending on whether a variable has an ordering, we can use that information. This is something that doesn’t have to be dealt with, but the added information can be useful.

Lastly, we have the encoding these categorical variables have. The same variable household_pet can be encoded as ["cat", "cat", "dog", "cat", "hamster"] or as [1, 1, 2, 1, 3]. The latter (hopefully) is accompanied by a data dictionary saying [1 = "cat", 2 = "dog", 3 = "hamster"]. These variables contain the exact same information, but the encoding is vastly different, and if you are not careful to treat household_pet as a categorical variable the model believes that "hamster" - "cat" = "dog".

TODO

find an example of the above data encoding in the wild

TODO

add a section about data formats that preserve levels

The chapters in this section can be put into 2 categories:

15.2 Categorical to Categorical

These methods take a categorical variable and improve them. Whether it means cleaning levels, collapsing levels, or making sure it handles new levels correctly. These Tasks as not always needed depending on the method you are using but they are generally helpful to apply. One method that would have been located here if it wasn’t for the fact that it has a whole section by itself is dealing with missing values.

15.3 Categorical to Numerical

The vast majority of the chapters in these chapters concern methods that take a categorical variable and produce one or more numerical variables suitable for modeling. There are quite a lot of different methods, all have upsides and downsides and they will all be explored in the remaining chapters.

These methods can further be categorized. This categorization allows us to better compare and contrast the methods to see how they are similar and different. The vast majority of methods are applied one column at a time. The notable exception is the combination multi-dummy extraction method.

The one-column-at-a-time methods can be put into 3 groups; dummy-based, target-based, and other.

The methods that are variations of Dummy Encoding are:

The methods that are variations of Target Encoding are:

The remaining methods don’t neatly slot into the above categories:

What you might notice is that many of these methods boil down to a left join when applied to new data. The difference is how you calculate the table, which we could consider an embedding table.

This idea gives rise to another way to deal with categorical variables. This is done by manually creating enrichment tables and then using them with categorical variables. One could imagine having a city predictor. It could be effectively encoded by the methods seen in this section. But we might be able to add more signals by enriching with characteristics of the city, like population, n_hospitals, and country It can be seen as a simple version of relational methods as seen in Chapter 136.