Recently I had an issue using the SQL adopted function dplyr::case_when in case it could be of help to someone. Spoiler: just read the documentation and run meaningful checks on your data.
The issue
The case_when function is just another great tool. As the IF statements in SPSS or replace in Stata, it is useful to generate a new column in a data frame which relies on the combination of other variables avoiding a verbose nest of ifelse. A quick example:
## Example## New column mpg_group in which mpg is split in three groups## low (x < 15); med (15 <= x < 25); high (x >= 25)mtcars_cw <- mtcars %>%mutate(mpg_grp =case_when( mpg <15~"Low", mpg >=15& mpg <25~"Medium", mpg >=25~"High") )## Checkmtcars_cw %>%group_by(mpg_grp) %>%summarise(min_mpg =min(mpg),max_mpg =max(mpg) )
# A tibble: 3 x 3
mpg_grp min_mpg max_mpg
<chr> <dbl> <dbl>
1 High 26 33.9
2 Low 10.4 14.7
3 Medium 15 24.4
Now let’s get into a slightly more complicated exercise: create a column mpg_cyl that takes Four cylinders if cyl == 4 and the mpg groups Low or Medium otherwise. Note that:
I omit the High category because all the cars with mpg > 25 also have four cylinders.
However, not all the cars with four cylinders have mpg > 25.
## Intuition (not always a good friend) says that the computation is top-bottom## So at the last step all the cyl == 4 will be replaced by "Four cylinders"## regardless their mpgmtcars_cw <- mtcars %>%mutate(mpg_cyl =case_when( mpg <15~"Low", # Evaluation order: first... mpg >=15& mpg <=25~"Medium", # ...second... cyl ==4~"Four cylinders") # ...and third )## Check: wrongtable(mtcars_cw$mpg_cyl, mtcars_cw$cyl)
4 6 8
Four cylinders 6 0 0
Low 0 0 5
Medium 5 7 9
The five cars with four cylinders and mpg < 25 are set to Medium instead of Four cylinders.
Why does this happen?
The ?case_when documentation states that:
Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.
In that case the syntax it shuld work since the last statement evaluated is cyl == 4 ~ "Four cylinders". However, and this is a HUGE however, if the values have been replaced once, they won’t be replaced again. This explains why the cyl == 4 ~ "Four cylinders" only affects the cases that have mpg > 25.
The right answer
## Expresions are evaluated in order top-bottom BUT## the case will only be replaced oncemtcars_cw <- mtcars %>%mutate(mpg_cyl =case_when( cyl ==4~"Four cylinders", # Evaluation order: first... mpg <15~"Low", # ...then from the cases left... mpg >=15& mpg <=25~"Medium") # ...and third from th cases left. )## Check: righttable(mtcars_cw$mpg_cyl, mtcars_cw$cyl)
4 6 8
Four cylinders 11 0 0
Low 0 0 5
Medium 0 7 9
How to avoid…
Reading the documentation is always a good start but getting your data tested after each transformation is even more important.