71  Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) is a method quite similar to Principal Component Analysis. PCA aims to create a transformation that maximizes the variance of the resulting variables, while making them uncorrelated. NMF, on the other hand, is a decomposition where the loadings are required to be non-negative.

TODO

Add a diagram showing the matrix decomposition.

All the factor loadings are calculated at once, which is in contrast to PCA, where each component is calculated one by one. This difference comes with some different effects. It means that there isn’t an ordering or ranking in the resulting features. In PCA, the first output feature will be the same (up to a sign) no matter if you ask for 1 PC or 10 PCs. This is not the case for NMF, as the signal is spread across all the output features. The amount of signal contained in each feature will be quasi-reversely correlated with the number of features. The number of components is thus a hyperparameter you want to tune, trying to find a number that pulls out useful signals in the data without trying to pack too much information into too few components.

Below is an example where we apply NMF to the MNIST data set, asking for 4 components. Each pixel is treated as a predictor, and the colors show the factor loadings for each component. Visually showing which part of the image it is activated from.

[1] 60000    28    28
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
βœ” broom        1.0.8     βœ” recipes      1.3.0
βœ” dials        1.4.0     βœ” rsample      1.3.0
βœ” dplyr        1.1.4     βœ” tibble       3.2.1
βœ” ggplot2      3.5.2     βœ” tidyr        1.3.1
βœ” infer        1.0.8     βœ” tune         1.3.0
βœ” modeldata    1.4.0     βœ” workflows    1.2.0
βœ” parsnip      1.3.1     βœ” workflowsets 1.1.0
βœ” purrr        1.0.4     βœ” yardstick    1.3.2
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
βœ– purrr::discard()         masks scales::discard()
βœ– dplyr::filter()          masks stats::filter()
βœ– yardstick::get_weights() masks keras::get_weights()
βœ– dplyr::lag()             masks stats::lag()
βœ– recipes::step()          masks stats::step()

Faceted tile chart. Each chart corresponds to a different factorization. In order for each component to look for the numbers 9, 0, 7, and 3. with The 7 mostly capturing the diagonal line.

NMF applied to all of MNIST

If we instead increase the number of components to 9, We see that each component pulls more and different shapes than before.

Faceted tile chart. Each chart corresponds to a different factorization. The numbers emerging from each component are a little harder to pinpoint.

NMF applied to all of MNIST

This is also a good time to note that many implementations of NMF depend on the initialization,

Faceted tile chart. Each chart corresponds to a different factorization. The components capture similar things to the last time, just in different ways components.

NMF applied to all of MNIST

Note that the features are correlated, as it wasn’t a restriction that they would be.

Correlation chart. The NMF features are lined up one after another.  All correlations appear to be non-zero. Mostly positive.
Figure 71.1: Non-zero correlation between all features.

Creating the decomposition is computationally harder, and more computationally expensive than PCA, An approximate implementation is often used to mitigate those facts. which sadly means that the algorithms are able to find local maxima, But not guaranteed to find global maxima. This leads to the solutions not being unique and may need to run multiple times with different seeds to find a better maximum.

There are a couple of restrictions on the data to which NMF can be applied. It only accepts numeric data, with no missing values, and no negative values. No missing values isn’t that bad of a restriction, as it is shared with most other dimensionality reduction methods. The non-negative restriction can be much more impactful. While the requirement that the input data is non-negative is a downside, it isn’t that big of a downside for many use cases, as non-negative data is quite common in many fields, as it represents counts and measurements quite well.

Since all the data is required to be non-negative and all the loading values are non-negative, We get quite nice interpretability as the different features don’t cancel each other out. Furthermore, some implementations are done to produce sparse loadings, making the interpretability even easier. The main downside of turning on sparsity is that we will need to find the right amount of sparsity to avoid decreases in performance.

71.2 Pros and Cons

71.2.1 Pros

  • More interpretable results

71.2.2 Cons

  • Data must be non-negative
  • Computationally expensive
  • Training depends on the seed

71.3 R Examples

We will be using the ames data set for these examples.

library(recipes)
library(modeldata)

ames_num <- ames |>
  select(where(is.numeric))

{recipes} provides step_nnmf_sparse(), which performs sparse NMF.

pca_rec <- recipe(~ ., data = ames_num) |>
  step_normalize(all_numeric_predictors()) |>
  step_nnmf_sparse(all_numeric_predictors(), num_comp = 5)

pca_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 2,930
Columns: 5
$ NNMF1 <dbl> 0.50223943, -0.43176151, -0.10637692, 0.75264746, 0.11416976, 0.…
$ NNMF2 <dbl> -0.24278764, -0.80251541, -0.23614126, 0.33140894, 0.32912945, 0…
$ NNMF3 <dbl> 0.021457403, 1.224688164, -0.294556573, 0.002563484, -0.19606363…
$ NNMF4 <dbl> 0.26255964, -0.10950662, 14.45899374, 0.01378767, -0.03785448, -…
$ NNMF5 <dbl> -0.07046248, -0.06248286, -0.06310062, -0.07464162, -0.09602224,…

71.4 Python Examples