51  Stemming

When we tokenize text we will end up with a lot of different types of tokens. More importantly, we will end up with different tokens, that feel very similar. Take the tokens "house", "houses", and "housing". Whether or not those tokens are similar or not depends on your task. If you think these tokens are the same, then you should consider applying stemming.

In its simplest form, stemming is the task of modifying tokens to collapse dissimilar tokens. Often this is seen done, by modifying the ends of words. In practice, this is done using rule-based algorithms or using regular expressions.

Caution

This doesn’t work quite well with BPE tokenizers. As those tokenizers already split up words into different parts of meaning. Changing those representations is unlikely to have a positive effect as it would have been picked up by the tokenizer itself.

The simplest English stemmer that has some useful applications is the S-Stemmer. This stemmer works by removing ending s’s on tokens. Especially turning plural versions of words into their non-plural version. "houses" to "house", "Happiness" to "Happine" and "is" to "i". We don’t mind that the results aren’t real words. What is important is that we are reducing the total number of tokens, in a hopefully useful way. If we took Jane Austen’s 6 published novels, tokenized them and applied the S-Stemmer, we would go front around 14520 to 12860 tokens, representing a drop of 11.4%.

We see above that some failures do appear with stemmers. Such as the "is" to "i" example. And it is a choice for us as the practitioner to determine if the non-informative collapsing is worth it. A simple script can be created to showcase all the collapsed tokens and whether we can tolerate them or not.

A well-used stemmer is the Porter Stemmer (Porter 1980). Porter himself released the implementation in the Snowball framework. The stemmer itself has implementations in almost any language you would need. The The English (Porter2) stemming algorithm is fully described on the website, as it is a little more advanced than the S-Stemmer, and there is little advantage of copying it into these pages. The last update to the algorithm happened in October 2023, as of the time of writing.

Note

you need a language-specific stemmer, as the language features will influence the best way to stem. The idea that a stemmer needs to modify the end of a word, comes from the fact that English words are stemmed by changing the end. But it is not a universal rule.

Lemmatization is a similar method to stemming. But with a few differences. The main similarity is that both methods aim to find a shorter representation of the token. Stemming does this by working on each token in isolation, using a set of rules to carry out an algorithm. Lemmatization on the other hand is linguistics-based, operating on the word in its context. Using Part of Speech and word meaning, you can find the lemma of words we are looking for. So a good implementation of a lemmatization would change the word to "say" from "said" and to "come" from "came". These methods are orders of magnitude slower than stemming but can be useful when time isn’t a limiting factor as the results tend to be better. One of the better software implementations that do lemmatization is spacy (Honnibal et al. 2020).

51.2 Pros and Cons

51.2.1 Pros

  • Can be a fast and easy way to reduce the number of tokens

51.2.2 Cons

  • Does require thorough inspection to avoid wrongful collapse

51.3 R Examples

51.4 Python Examples