15 Categorical Overview
15.1 Categorical Overview
One of the most common types of data you can encounter is categorical variables. These variables take non-numeric values and are things like; names
, animal
, housetype
zipcode
, mood
, and so on. We call these qualitative variables and you will have to find a way to deal with them as they are plentiful in most data sets you will encounter. You will, however, unlike numerical variables in Chapter 1, have to transform these variables into numeric variables as most models only work with numeric variables. The main exception to this is tree-based models such as decision trees, random forests, and boosted trees. This is of cause a theoretical fact, as some implementations of tree-based models donβt support categorical variables.
The above description is a small but useful simplification. A categorical variable can take a known or an unknown number of unique values. day_of_the_week
and zipcode
are examples of variables with a fixed known number of values. Even if our data only contains Sundays, Mondays, and Thursdays we know that there are 7 different possible options. On the other hand, there are plenty of categorical variables where the levels are realistically unknown, such as company_name
, street_name
, and food_item
. This distinction matters as it can be used to inform the pre-processing that will be done.
Another nuance is some categorical variables have an inherent order to them. So the variables size
with the values "small"
, "medium"
, and "large"
, clearly have an ordering to then "small" < "medium" < "large"
. This is unlike the theoretical car_color
with the values βblueβ,
βredβ, and
βblackβ`, which doesnβt have a natural ordering. Depending on whether a variable has an ordering, we can use that information. This is something that doesnβt have to be dealt with, but the added information can be useful.
Lastly, we have the encoding these categorical variables have. The same variable household_pet
can be encoded as ["cat", "cat", "dog", "cat", "hamster"]
or as [1, 1, 2, 1, 3]
. The latter (hopefully) is accompanied by a data dictionary saying [1 = "cat", 2 = "dog", 3 = "hamster"]
. These variables contain the exact same information, but the encoding is vastly different, and if you are not careful to treat household_pet
as a categorical variable the model believes that "hamster" - "cat" = "dog"
.
find an example of the above data encoding in the wild
add a section about data formats that preserve levels
The chapters in this section can be put into 2 categories:
15.2 Categorical to Categorical
These methods take a categorical variable and improve them. Whether it means cleaning levels, collapsing levels, or making sure it handles new levels correctly. These Tasks as not always needed depending on the method you are using but they are generally helpful to apply. One method that would have been located here if it wasnβt for the fact that it has a whole section by itself is dealing with missing values as seen in Chapter 42.
15.3 Categorical to Numerical
The vast majority of the chapters in these chapters concern methods that take a categorical variable and produce one or more numerical variables suitable for modeling. There are quite a lot of different methods, all have upsides and downsides and they will all be explored in the remaining chapters.
further categorize these methods
- multiple columns
- single columns
- boils down to left-join