102  Colors

Color is an interesting attribute that on the surface appears simple, but can be analyzed in many different ways. One could treat them like categorical features, ignoring the inherent structure and connections that come with colors.

Color names are strongly tied to language and culture, it is thus imperative that we know that when creating mappings between color names and a numerical representation.

TODO

Find good references for color and history.

We also have that there can be a lot of ambiguity in color names. You see this most prominently when buying paint, with each store or chain having its own, sometimes humourous, names for each shade they can produce. These names try to help customers distinguish between small differences in shades. Color names can also be quite broadly used. The color β€œgreen” could in context mean an exact hue, and in another context refers to all the colors seen in a forest. The latter being akin to categorical collapse. All of this is to say that we think about our data, to allow us to better extract the signal if it is there.

Assuming we want to use precise mapping, then we can construct a table of color names and their corresponding precise representation. When working with computers, a commonly used way to present colors is using hex codes, which uses a six-digit hexadecimal number to represent a color. They are represented as #a8662b with the first 2 digits representing how red the color is, the second 2 digits representing how green it is, and the last 2 digits representing how blue it is. This gives us 16^6 = 16,777,216 unique colors, which isn’t enough to specify all possible colors but good enough for our use cases.

TODO

Find examples of color lists, maybe even create data-base.

library(prismatic)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
βœ” broom        1.0.6          βœ” recipes      1.1.0.9000
βœ” dials        1.3.0          βœ” rsample      1.2.1     
βœ” dplyr        1.1.4          βœ” tibble       3.2.1     
βœ” ggplot2      3.5.1          βœ” tidyr        1.3.1     
βœ” infer        1.0.7          βœ” tune         1.2.1     
βœ” modeldata    1.4.0          βœ” workflows    1.1.4     
βœ” parsnip      1.2.1          βœ” workflowsets 1.1.0     
βœ” purrr        1.0.2          βœ” yardstick    1.3.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
βœ– purrr::discard() masks scales::discard()
βœ– dplyr::filter()  masks stats::filter()
βœ– dplyr::lag()     masks stats::lag()
βœ– recipes::step()  masks stats::step()
β€’ Use tidymodels_prefer() to resolve common conflicts.
color_names <- c(
    "black",
    "brown",
    "gray",
    "white",
    "orange",
    "tan",
    "blue",
    "pink",
    "yellow",
    "chocolate"
 )

tibble(name = color_names) |>
  mutate(hexcode = stringr::str_sub(color(name), 1, 7)) |>
  mutate(
 red = prismatic::clr_extract_red(hexcode),
 green = prismatic::clr_extract_green(hexcode),
 blue = prismatic::clr_extract_blue(hexcode)
 )
# A tibble: 10 Γ— 5
   name      hexcode   red green  blue
   <chr>     <chr>   <int> <int> <int>
 1 black     #000000     0     0     0
 2 brown     #A52A2A   165    42    42
 3 gray      #BEBEBE   190   190   190
 4 white     #FFFFFF   255   255   255
 5 orange    #FFA500   255   165     0
 6 tan       #D2B48C   210   180   140
 7 blue      #0000FF     0     0   255
 8 pink      #FFC0CB   255   192   203
 9 yellow    #FFFF00   255   255     0
10 chocolate #D2691E   210   105    30

Through these hex codes, we already have some numeric representations that we can use for modeling. However, they may not be the most effective representation depending on what question we are trying to answer. This is where the idea of color spaces comes in. The one we have worked with is the RGB space, easy to use and understand but doesn’t translate well to notions that we typically care about like β€œHow dark is this color”. Another color space that might be able to solve these problems better would be the HSL color space. This is a color space that uses 3 values to describe its color, by its hue (think rainbow) that takes values between 0 and 360, saturation which you can define as its colorfulness relative to its own brightness on a scale from 0 to 100, and lightness which tells you how bright it is compared to pure white on a scale from 0 to 100.

tibble(name = color_names) |>
  mutate(hexcode = stringr::str_sub(color(name), 1, 7)) |>
  mutate(
 hue = prismatic::clr_extract_hue(hexcode),
 saturation = prismatic::clr_extract_saturation(hexcode),
 lightness = prismatic::clr_extract_lightness(hexcode)
 )
# A tibble: 10 Γ— 5
   name      hexcode   hue saturation lightness
   <chr>     <chr>   <dbl>      <dbl>     <dbl>
 1 black     #000000   0          0         0  
 2 brown     #A52A2A   0         59.4      40.6
 3 gray      #BEBEBE   0          0        74.5
 4 white     #FFFFFF   0          0       100  
 5 orange    #FFA500  38.8      100        50  
 6 tan       #D2B48C  34.3       43.7      68.6
 7 blue      #0000FF 240        100        50  
 8 pink      #FFC0CB 350.       100        87.6
 9 yellow    #FFFF00  60        100        50  
10 chocolate #D2691E  25         75        47.1

Viewing these colors in this color space allows us to create different features. We can now with relatively easy say if a color is close to blue, by looking at whether its hue is sufficiently close to 240. This could be expanded to any color on the hue wheel. We can likewise ask straightforward questions about saturation and lightness.

TODO

This section would benefit from illustrations of the color spaces

Imagine you wanted a feature to say β€œHow close is this measured color to my reference color”, then you would need something called a perceptually uniform color space. These color spaces try to make Euclidian distances makes sense, examples include CIELAB and Oklab. The downside of these spaces is that each of the axes doesn’t contain any meaningful information.

how <- function(x, y) {
 x <- prismatic::color(x) |>
 farver::decode_colour()

 y <- prismatic::color(y) |>
 farver::decode_colour()

 farver::compare_colour(x, y, 'rgb', method = 'cie2000')[, 1]
}

tibble(name = color_names) |>
  mutate(hexcode = stringr::str_sub(color(name), 1, 7)) |>
  mutate(
 red = how(hexcode, "red"),
 green = how(hexcode, "green"),
 blue = how(hexcode, "blue"),
 orange = how(hexcode, "orange")
 )
# A tibble: 10 Γ— 6
   name      hexcode   red green  blue orange
   <chr>     <chr>   <dbl> <dbl> <dbl>  <dbl>
 1 black     #000000  50.4  87.9  39.7   69.9
 2 brown     #A52A2A  18.9  86.3  41.7   45.9
 3 gray      #BEBEBE  36.8  33.2  54.1   28.9
 4 white     #FFFFFF  45.8  33.3  64.2   33.1
 5 orange    #FFA500  33.8  48.4  78.1    0  
 6 tan       #D2B48C  34.4  36.6  66.1   17.1
 7 blue      #0000FF  52.9  83.2   0     78.1
 8 pink      #FFC0CB  34.9  63.8  56.8   35.8
 9 yellow    #FFFF00  64.3  23.4 103.    29.3
10 chocolate #D2691E  15.5  64.4  58.6   20.3

These are by no means all we can do with colors as predictors, but it might spark some helpful creativity.

102.2 Pros and Cons

102.2.1 Pros

  • Using color spaces to write creative features can provide a significant impact

102.2.2 Cons

-Creating the mappings between color words and their numerical representation can be challenging

102.3 R Examples

Wait for steps to do this.

library(tidymodels)
library(animalshelter)
longbeach |>
  mutate(primary_color = stringr::str_remove(primary_color, " .*")) |>
  count(primary_color, sort = TRUE) |>
  print(n = 20)

102.4 Python Examples