How I got run over by dplyr::case_when

no

Recently I had an issue using the SQL adopted function dplyr::case_when in case it could be of help to someone. Spoiler: just read the documentation and run meaningful checks on your data.

The issue

The case_when function is just another great tool. As the IF statements in SPSS or replace in Stata, it is useful to generate a new column in a data frame which relies on the combination of other variables avoiding a verbose nest of ifelse. A quick example:

## Example
## New column mpg_group in which mpg is split in three groups
## low (x < 15); med (15 <= x < 25); high (x >= 25)

mtcars_cw <- mtcars %>% 
  mutate(mpg_grp = case_when(
    mpg < 15               ~ "Low",
    mpg >= 15 & mpg < 25  ~ "Medium",
    mpg >= 25             ~ "High")
    )

## Check
mtcars_cw %>% 
  group_by(mpg_grp) %>% 
  summarise(min_mpg = min(mpg),
            max_mpg = max(mpg)
            )
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
##   mpg_grp min_mpg max_mpg
##   <chr>     <dbl>   <dbl>
## 1 High       26      33.9
## 2 Low        10.4    14.7
## 3 Medium     15      24.4

Now let’s get into a slightly more complicated exercise: create a column mpg_cyl that takes Four cylinders if cyl == 4 and the mpg groups Low or Medium otherwise. Note that:

  1. I omit the High category because all the cars with mpg > 25 also have four cylinders.
  2. However, not all the cars with four cylinders have mpg > 25.
## Intuition (not always a good friend) says that the computation is top-bottom
## So at the last step all the cyl == 4 will be replaced by "Four cylinders"
## regardless their mpg

mtcars_cw <- mtcars %>% 
  mutate(mpg_cyl = case_when(
    mpg < 15               ~ "Low", # Evaluation order: first...
    mpg >= 15 & mpg <= 25  ~ "Medium", # ...second...
    cyl == 4               ~ "Four cylinders") # ...and third
    )

## Check: wrong
table(mtcars_cw$mpg_cyl, mtcars_cw$cyl)
##                 
##                  4 6 8
##   Four cylinders 6 0 0
##   Low            0 0 5
##   Medium         5 7 9

The five cars with four cylinders and mpg < 25 are set to Medium instead of Four cylinders.

Why does this happen?

The ?case_when documentation states that:

Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.

In that case the syntax it shuld work since the last statement evaluated is cyl == 4 ~ "Four cylinders". However, and this is a HUGE however, if the values have been replaced once, they won’t be replaced again. This explains why the cyl == 4 ~ "Four cylinders" only affects the cases that have mpg > 25.

The right answer

## Expresions are evaluated in order top-bottom BUT
## the case will only be replaced once

mtcars_cw <- mtcars %>% 
  mutate(mpg_cyl = case_when(
    cyl == 4               ~ "Four cylinders", # Evaluation order: first...
    mpg < 15               ~ "Low", # ...then from the cases left...
    mpg >= 15 & mpg <= 25  ~ "Medium") # ...and third from th cases left.
    )

## Check: right
table(mtcars_cw$mpg_cyl, mtcars_cw$cyl)
##                 
##                   4  6  8
##   Four cylinders 11  0  0
##   Low             0  0  5
##   Medium          0  7  9

How to avoid…

Reading the documentation is always a good start but getting your data tested after each transformation is even more important.