49 Text Cleaning

49.1 Text Cleaning

When working with unstructured data such as free-form text. You likely will need to do some cleaning. As with basically every method in this book, not all the steps in this chapter should be applied, and it will be up to you the practitioner to evaluate what is best for your data. It is also important that you formalize what these steps should be such that you can apply them consistently to new data that comes in.

I’ll split this chapter into 2 major categories. It doesn’t mean that they are the only ones, but it helps to keep things clear in our minds. These two categories are removing things and transforming things.

Removing things should be quite self-explanatory. This group of actions can be quite lengthy depending on how clean your data is when you get it. HTML tags, Markdown, and other artifacts are prime candidates for removal as they have less information about the words written and more about how they are displaced. You can also get malformed text for different reasons, with the most common being speech-to-text and optical character recognition (OCR). If you get this kind of malformed text, you should try your best to have that part of the process improved, to hopefully avoid the malformed text altogether, but the escape-hatch would be to remove it. Sometimes people will remove all non-alphanumeric values from the text. This is a quick and crude method of getting something decent and easy to work with. This is also popular because it can be done using a small regular expression. The crudeness because a problem as it removes indiscriminately such that periods, question marks, emojis, and non-Latin characters are removed.

The crudeness of the last approach is what inspires some of the transformation actions we can take. If we are going to count different parts of the text for analysis, then we might benefit from some bucketing. We could replace all emojis with EMOJI to denote that an emoji was present. The same could be done with emails, hashtags, user names, dates, datetimes, and URLs. Doing this will rob us of the information about which emojis were used, but it lets us count how many times and where they are used instead.

# Before
"I replied to example@email.com on 2020-05-01"

# After
"I replied to EMAIL on DATE"

Another case of bucketing is done by lower-casing all the text. By making all the characters lowercase we lose some information. In English, we lose proper nouns and the starts of sentences and in German, we lose all nouns. We do this when we work under the assumption that Today and today isn’t a meaningful difference to our analysis.

One thing that can happen when we lowercase is that some works we don’t want to combine are combined, it and IT make a good example. replacing IT with information technology is one way to solve this issue. As you can see, there are a lot of different ways to improve our data quality, and spending some time at this part of the process tends to give better results as it affects every part of the rest of the pipeline.

The choices you make on the actions above will depend on where the text is supposed to end up. If we want to use the data as part of a supervised machine learning model then we will likely be counting tokens, and keeping periods will not be helpful. if we want text generation, we need the periods as it will try to replicate text.

When working with text data you will inevitably run into encodings and their associated irregularities.

text encoding

Computers work with ones and zeroes. To store text we got creative and started assigning each character to a different number. Numbers already have a good representation in memory. Common encodings include ASCII which includes 128 different symbols (7 bits), and UTF-8 includes 1,112,064 symbols.

What this means in practice is that you can run into issues by reading data in as the wrong encoding, or if multiple pieces of text read the same, but constructed by different characters. You are less and less likely to run into encoding confusion as more and more text is encoded with UTF-8. These issues appear as glitched text, if you see that you can use one of the many encoding detection methods online to try to narrow down how to read the text correctly.

The other issue is much harder to detect unless you know what you are looking for. Consider the two strings "schön" and "schön", are they the same? the reasonable reader would say yes. But the computer disagrees, if you look at the Unicode codes you see that they are different \u0073\u0063\u0068\u00f6\u006e and \u0073\u0063\u0068\u006f\u0308\u006e. The difference comes in the o with the umlaut. In the first string, 0f6 is a “latin small letter o with diaeresis” where 06f is a “latin small letter o”m and 308 is a “combining diaeresis”. This means the first string uses one symbol to represent "ö" and the second string uses two symbols to represent it, but combining a "0" with an umlaut symbol. And they appear similar to the naked eye.

This is an issue because the computer considers these strings as different strings, so for counting operations they are counted differently. Luckily there are solutions to this problem, and the method is known as unicode normalization. The ways these methods work could be a chapter by itself and will be excluded. But the methods by themselves are easy to find and use on your data. Unless you are sure that you won’t have these issues, or you are on a big performance crush, it is recommended to use Unicode normalization at all times.

49.2 Pros and Cons

49.2.1 Pros

When applied correctly, can lead to boosts in insights into the data

49.2.2 Cons

Can be a quite manual process which will likely be domain specific

49.3 R Examples

TODO

Find a data set that isn’t clean

We will use the step_text_normalization() function from the {textrecipes} package to perform unicode normalization, which defaults to the NFC normalization form.

library(textrecipes)

Loading required package: recipes

Loading required package: dplyr


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Attaching package: 'recipes'

The following object is masked from 'package:stats':

    step

sample_data <- tibble(text = c("sch\U00f6n", "scho\U0308n"))

sample_data |>
  count(text)

# A tibble: 2 × 2
  text      n
  <chr> <int>
1 schön     1
2 schön     1

rec <- recipe(~., data = sample_data) |>
  step_text_normalization(text) |>
  prep()

rec |>
  bake(new_data = NULL) |>
  count(text)

# A tibble: 1 × 2
  text      n
  <fct> <int>
1 schön     2