first | second | third |
---|---|---|
english | danish | |
danish | english | german |
spanish | ||
spanish | english |
37 Multi-Dummy Encoding
37.1 Multi-Dummy Encoding
This chapter will cover what I like to call multi-dummy encoding. When you are able to extract multiple entries from one categorical variable or combine entries from multiple categorical variables.
I find that this is best explained by example. We will start with combining entries from multiple variables. Imagine that you see this subsection in your data, perhaps denoting some individualβs language proficiencies. How would you encode it?
One could apply dummy variables on each of them individually, but that could potentially create a lot of columns as you would expect a near-equal number of levels produced from each variable. You could apply Target Encoding to the variables, but that too feels insufficient. What these methods donβt take into account is that there is some shared information between these variables that isnβt being picked up.
We can use the shared information to our advantage. The levels used in these categorical variables are likely the same, so we can treat them combined. In practice, this means that we count over all selected columns, and add dummies or counts accordingly.
danish | english | german | spanish |
---|---|---|---|
1 | 1 | 0 | 0 |
1 | 1 | 1 | 0 |
0 | 0 | 0 | 1 |
0 | 1 | 0 | 1 |
This style of transformation often provides zero-one indicators rather than counts purely because of the construction of the data. But it doesnβt mean that counts canβt happen. Make sure that the implementation you are using matches the expectations you have for the data.
This data configuration often contains missing values. Above is represented by emptiness. Be advised to make sure the tools you are using understand the designation.
One thing that is lost in the method is the potential ordering of the variables. The above example has a natural ordering, indicative by the names of the variables. Depending on how important you think the ordering is, one could add a weighting scheme like below.
danish | english | german | spanish |
---|---|---|---|
0.5 | 1.0 | 0.00 | 0 |
1.0 | 0.5 | 0.25 | 0 |
0.0 | 0.0 | 0.00 | 1 |
0.0 | 0.5 | 0.00 | 1 |
you need to remember what 0 means here. The weighting scheme should take that into account. One could opt to use a linear weight, but make sure that the first column has the highest value, instead of starting at 1 and going up.
Next, we look at extracting multiple entries. This can be seen as a convenient shorthand for text extraction, as is discussed in Chapter 47. I find that this pattern emerges enough by itself that it is worth denoting it as its own method.
The above example could instead be structured so
languages |
---|
english, danish |
danish, english, german |
spanish |
spanish, english |
We can pull out the same counts as before, using regular expressions. One could pull out the entries in two main ways, by splitting or extraction. We could either split by ,
or extract sequences of characters [a-z]*
.
With splitting, you can sometimes extract a lot of signal if you have a carefully crafted regular expression. Consider the following list of materials and mediums for a number of art pieces.
medium |
---|
Oil paint on canvas |
Painted steel and salt |
Etching on paper |
Lithograph on paper |
At first glance, they appear quite unstructured, but by splitting on (, )|( and )|( on )
you get the following mediums and techniques [canvas, etching, lithograph, oil paint, painted steel, paper, salt]
.
37.2 Pros and Cons
37.2.1 Pros
- Can provide increased signal
37.2.2 Cons
- Less commonly occurring
- Requires careful eye and hand tuning
37.3 R Examples
- https://recipes.tidymodels.org/reference/step_dummy_multi_choice.html
- https://recipes.tidymodels.org/reference/step_dummy_extract.html
Wait for adoption data set
37.4 Python Examples
Iβm not aware of a good way to do this in a scikit-learn way. Please file an issue on github if you know of a good way.