How I got run over by dplyr::case_when
Recently I had an issue using the SQL
adopted function dplyr::case_when
in case it could be of help to someone. Spoiler: just read the documentation and run meaningful checks on your data.
The issue
The case_when
function is just another great tool. As the IF
statements in SPSS or replace
in Stata, it is useful to generate a new column in a data frame which relies on the combination of other variables avoiding a verbose nest of ifelse
. A quick example:
## Example
## New column mpg_group in which mpg is split in three groups
## low (x < 15); med (15 <= x < 25); high (x >= 25)
mtcars_cw <- mtcars %>%
mutate(mpg_grp = case_when(
mpg < 15 ~ "Low",
mpg >= 15 & mpg < 25 ~ "Medium",
mpg >= 25 ~ "High")
)
## Check
mtcars_cw %>%
group_by(mpg_grp) %>%
summarise(min_mpg = min(mpg),
max_mpg = max(mpg)
)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## mpg_grp min_mpg max_mpg
## <chr> <dbl> <dbl>
## 1 High 26 33.9
## 2 Low 10.4 14.7
## 3 Medium 15 24.4
Now let’s get into a slightly more complicated exercise: create a column mpg_cyl
that takes Four cylinders
if cyl == 4
and the mpg groups Low
or Medium
otherwise. Note that:
- I omit the
High
category because all the cars withmpg > 25
also have four cylinders. - However, not all the cars with four cylinders have
mpg > 25
.
## Intuition (not always a good friend) says that the computation is top-bottom
## So at the last step all the cyl == 4 will be replaced by "Four cylinders"
## regardless their mpg
mtcars_cw <- mtcars %>%
mutate(mpg_cyl = case_when(
mpg < 15 ~ "Low", # Evaluation order: first...
mpg >= 15 & mpg <= 25 ~ "Medium", # ...second...
cyl == 4 ~ "Four cylinders") # ...and third
)
## Check: wrong
table(mtcars_cw$mpg_cyl, mtcars_cw$cyl)
##
## 4 6 8
## Four cylinders 6 0 0
## Low 0 0 5
## Medium 5 7 9
The five cars with four cylinders and mpg < 25
are set to Medium
instead of Four cylinders
.
Why does this happen?
The ?case_when
documentation states that:
Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.
In that case the syntax it shuld work since the last statement evaluated is cyl == 4 ~ "Four cylinders"
. However, and this is a HUGE however, if the values have been replaced once, they won’t be replaced again. This explains why the cyl == 4 ~ "Four cylinders"
only affects the cases that have mpg > 25
.
The right answer
## Expresions are evaluated in order top-bottom BUT
## the case will only be replaced once
mtcars_cw <- mtcars %>%
mutate(mpg_cyl = case_when(
cyl == 4 ~ "Four cylinders", # Evaluation order: first...
mpg < 15 ~ "Low", # ...then from the cases left...
mpg >= 15 & mpg <= 25 ~ "Medium") # ...and third from th cases left.
)
## Check: right
table(mtcars_cw$mpg_cyl, mtcars_cw$cyl)
##
## 4 6 8
## Four cylinders 11 0 0
## Low 0 0 5
## Medium 0 7 9
How to avoid…
Reading the documentation is always a good start but getting your data tested after each transformation is even more important.