76  Uniform Manifold Approximation and Projection

Uniform Manifold Approximation and Projection (UMAP) is another method that takes high-dimensional data and embeds it into a lower-dimensional space. This method is in the same category of encoders as autoencoders and ISOMAP, as they allow for non-linear compression of the data. This method got popular in part because of how fast it runs, and because people tend to like the visualization it provides.

UMAP is an iterative stochastic method. It works by doing the following steps:

  1. Use spectral embedding to embed points in a low-dimensional space
  2. Calculate similarity scores between all pairs of points based on the original data set
  3. Randomly samples a pair of points based on their similarity scores
  4. Flips a coin to decide which of the pair of points to focus on
  5. Randomly picks a non-neighbor point to move away from
  6. Moves the selected point towards its neighbor and away from its non-neighbor
  7. Repeat 3-6

Beyond arguments such as n_epochs, which would determine how long to run this algorithm for, and n_components to determine the number of components to return. There are 3 main hyperparameters: n_neighbors, min_dist, and metric. n_neighbors determines how many points are considered neighbors. A point counts as its own neighbor. Lower values lead to a local view. min_dist determines how close points are allowed to be to each other in the low-dimensional space. metric determines how distances are calculated in the input data: euclidean, manhattan, jaccard, etc. There are many more arguments for UMAP, It is quite a flexible method that has been used in many fields.

One of the consequences of the algorithm is that only local distances matter in our embedding. That is to say that points that are close together in our embedding are also close together in the original data, but each component doesn’t carry any actionable information beyond its ability to separate points.

While UMAP is very popular, you need to be very careful when using it. The flexibility of UMAP is both its greatest strength and greatest weakness in feature engineering. Take a look at the figure below:

── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
βœ” broom        1.0.10     βœ” recipes      1.3.1 
βœ” dials        1.4.2      βœ” rsample      1.3.1 
βœ” dplyr        1.1.4      βœ” tailor       0.1.0 
βœ” ggplot2      4.0.0      βœ” tidyr        1.3.1 
βœ” infer        1.0.9      βœ” tune         2.0.0 
βœ” modeldata    1.5.1      βœ” workflows    1.3.0 
βœ” parsnip      1.3.3      βœ” workflowsets 1.1.1 
βœ” purrr        1.1.0      βœ” yardstick    1.3.2 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
βœ– purrr::discard() masks scales::discard()
βœ– dplyr::filter()  masks stats::filter()
βœ– dplyr::lag()     masks stats::lag()
βœ– recipes::step()  masks stats::step()
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
β„Ή Using compatibility `.name_repair`.

Faceted scatter plot. X and Y each contain a UMAP component, and the  facets are the number of neighbors. The facets values are 2, 3, 5, and 10. When neighbors = 2 the points cluster in around 5 tight clusters. When neighbors = 3 the points appear evenly distributed in a large blob shape. When neighbors = 5 the points appear distrubuted in a circular shape. When neighbors = 10 the points appear distributed in a ring,  with few points in the middle.

Here we have 4 different applications of UMAP with different values of n_neighbors. Depending on the value, we either see clusters, structure, or no structure at all. This figure used the exact same data in all 4 applications, with the original data being normally distributed.

You are discouraged from using small values of n_neighbors for similar reasons, but it is worth noting that you might fool yourself into finding relationships in your data that aren’t there. This is even more important for feature engineering, as it can be harder to visually validate the results in higher dimensions.

You use UMAP more to extract separation between clusters in the data, and to preserve local structure, than to generate components that in and of themselves contain valuable information. This is one of the reasons why UMAP is so popular in clustering projects.

76.2 Pros and Cons

76.2.1 Pros

  • Fairly fast

76.2.2 Cons

  • Very little explainability
  • Has a lot of hyperparameters

76.3 R Examples

We will be using the ames data set for these examples.

library(recipes)
library(modeldata)

ames_num <- ames |>
  select(where(is.numeric))

{embed} provides step_umap(), which is the standard way to perform UMAP.

umap_rec <- recipe(~ ., data = ames_num) |>
  step_normalize(all_numeric_predictors()) |>
  step_umap(all_numeric_predictors(), num_comp = 5)

umap_rec |>
  prep() |>
  bake(new_data = NULL) |>
  glimpse()
Rows: 2,930
Columns: 5
$ UMAP1 <dbl> -0.43740699, -2.45369506, -2.92550945, 0.08030959, 1.12957335, 1…
$ UMAP2 <dbl> 0.4267507, -1.6785691, -1.4495940, 1.0937077, 2.6854937, 2.63307…
$ UMAP3 <dbl> 0.06459869, 1.45441735, 2.00685763, -0.44243026, -1.05284834, -1…
$ UMAP4 <dbl> -1.578369260, 0.139115691, -0.001287348, -2.074141502, 2.9047153…
$ UMAP5 <dbl> 0.403661758, 0.813153148, 0.216809466, 0.318011522, -1.913434148…

76.4 Python Examples

WIP