16  Cleaning

For cleaning categorical data, there is a whole host of problems you can have in your data. We won’t be able to describe how to deal with every single type of problem. Instead, we will go over a class of common problems. In an ideal world, you wouldn’t have these problems, but that isn’t one we are living in right now. Mistakes are happening for a multitude of reasons and it is your job to deal with them.

TODO

Figure out how to properly reference https://twitter.com/camfassett/status/1575578362927452160

Look at the following vector of spellings of St Albans, the city in England.

[1] "St Albans"  "St. Albans" "St  Alban"  "st albans"  "St. Alban" 
[6] "St Albans " "st Albans" 

They all refer to the same city name, they don’t just agree on the spelling. If we didn’t perform any cleaning, most of the following methods would assume that these spellings are completely different works and should be treated as such. Depending on the method this would vastly change the output and dilute any information you may have in your data.

This will of course depend on whether these spelling differences are of importance in your model. Only you, the practitioner, will know the answer to that question. This is where you need to be in touch with the people generating your data. In the above situation, one could imagine that the data comes from 2 sources; one using a drop-down menu to select the city, and one where the city is being typed in manually on a tablet. There will be big differences. In a case such as that, we would want all these spellings to be fixed up.

Domain knowledge is important. In the case of spell-checking, we need to be careful not to over-correct the data by collapsing two different items together by accident.

Back to the example at hand. The first thing we notice is that there is a difference in capitalization. In this case, since we presumably are working with a variable of city names, capitalization shouldn’t matter. A common trick is to either turn everything upper-case or lower-case. I prefer to lower-casing as I find the results easier to read.

[1] "st albans"  "st. albans" "st  alban"  "st. alban"  "st albans "

By turning everything into lower-case, we were able to remove 2 of the errors. Next, we see that some of these spellings include periods. For this example, I’m going to decide that they are safe to remove as well.

[1] "st albans"  "st  alban"  "st alban"   "st albans "

This removes yet another of our problems. This next problem is horrible and is so easy to overlook. Notice how the second spelling has a double space between st and alban? At a glance that can be hard to see. Likewise, the last spelling has a trailing space. Let us fix this as well.

[1] "st albans" "st alban" 

Now we are left with two values, which we manually would have to deal with. But even if we had to manually write out the mapping, it is a lot easier to write it for 2 different spellings than 7.

Another problem that you may or may not run into, depends on the resilience of your modeling package. Some implementations are rather fragile when it comes to the column names of your data, Non-ASCII characters, punctuation and even spaces can cause errors. At this point in our journey, we are almost there, and we can replace the spaces with underscores to be left with st_albans.

One problem we didn’t see in the above example is what happens when you are working with accented characters. Consider the German word β€œschΓΆn.” The o with an umlaut (two dots over it) is a fairly simple character, but it can be represented in a couple of different ways. We can either use a single character \U00f6 to represent the letter with an umlaut. Alternatively, we can use two characters, one for the o and one character to denote the presence of two dots over the previous character \U0308. As you can imagine this can happen with many words if you have free-form data from a wide array of backgrounds. The method needed to fix this problem is called Unicode Normalization.

TODO

Pull nuggets from https://twitter.com/Emil_Hvitfeldt/status/1583466133977382912

This chapter was not meant to scare you, what I hope you get away from this chapter is that it is very important that you look at your data carefully, especially categorical variables that can be varied in so many unexpected ways. Nevertheless, using the above-described techniques, combined with some subject matter expertise should get you quite a long way.

16.2 R examples

TODO

find a good data set for these examples

Use janitor

textrecipes::step_clean_levels()

16.3 Python Examples