56 TF-IDF
56.1 TF-IDF
Term Frequency inverse document frequency is the next iteration based on term frequency we explored in Chapter 55. As the name implies, it is what happens when you take the term frequencies and multiply them by the inverse document frequency.
\[ TF-IDF(t, d) = TF(t, d) \times IDF(t) \]
Conceptually, we start by creating the term frequency matrix we created in Chapter 55. Then we Look at each term/token. We calculate the inverse document frequency by dividing the number of observations we have, by how many observations that token appears in. Once we have that number we take the logarithm of it. There will thus be one IDF value for each token. These values will be multiplied column-wise to the term frequency matrix.
You can think of IDF as a dampening value. If a given token appears in 1
out of 100
document, then the IDF of that token is r round(log(100 / 1), 2)
, if it is 1
out of 10
the value is r round(log(10 / 1), 2)
, if the token appears in 90% of the documents the IDF is r round(log(10 / 1), 2)
and if the token appears in every document then the IDF is r 0
. We reward tokens that appear in a few documents and punish tokens that appear in every document.
It is important to note that TF-IDF is a trained method. So we need to save the IDF values so we can apply them to new incoming observations.
Above is explained the main idea behind TF-IDF calculations. In practice, we see a couple of modifications and options one might take. Smoothing of the IDF values is commonly done as a default to avoid division by zero issues. It is done by (TODO).
clarify which smoothing method I will cover
We could also do probabilistic inverse document frequency. We calculate this by having the numerator be the number of documents minus the number of occurrences instead of just the number of documents.
add a graph of how these weighting scales differ
The idea of TF-IDF was first proposed in 1972 (SPARCK JONES 1972), while it works well in practice, it wasnβt based on much theory. Efforts have been made to see how we can ground IDF calculations in theory (Robertson 2004), but it doesnβt change the uses.
56.2 Pros and Cons
56.2.1 Pros
- Can lead to performance gains if the IDF capture relevant words
56.2.2 Cons
- Leads to harder interpretations than term frequencies