## R: calculate time difference between specific events

Question

I have the following dataset:

``````df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,0,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))

library(lubridate)
df\$order_date = as_date(df\$order_date)

user_id complete_order order_date
1              1 2015-01-28
1              0 2015-01-31
1              0 2015-02-08
1              1 2015-02-23
2              1 2015-01-25
2              0 2015-01-28
2              0 2015-02-06
2              1 2015-02-21
``````

I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:

``````user_id complete_order order_date complete_order_time_diff
<fctr>         <fctr>     <date>              <time>
1              1    2015-01-28             NA days
1              0    2015-01-31              3 days
1              0    2015-02-08             11 days
1              1    2015-02-23             26 days
2              1    2015-01-25             NA days
2              0    2015-01-28              3 days
2              0    2015-02-06             12 days
2              1    2015-02-21             27 days
``````

when I try this solution:

``````library(dplyr)

df %>%
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))
``````

it returns the error:

`Error: incompatible size (3), expecting 4 (the group size) or 1`

Any help with this will be great, thank you!

Show source

## Answers to R: calculate time difference between specific events ( 3 )

1. I think you can add a `filter` function in place of the subsetting with `order_date[complete_order == 1]` and make sure the `order_date` (and other variables) are the correct data types by adding `stringsAsFactors = F` to `data.frame()`):

``````df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,1,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')),
stringsAsFactors = F)

df\$order_date <- lubridate::ymd(df\$order_date)

df %>%
group_by(user_id) %>%
filter(complete_order == 1) %>%
mutate(complete_order_time_diff = order_date - lag(order_date))
``````

This returns the time until the next complete order (and `NA` if there is not one):

``````  user_id complete_order order_date complete_order_time_diff
<chr>          <chr>     <date>                   <time>
1       1              1 2015-01-28                  NA days
2       1              1 2015-01-31                   3 days
3       1              1 2015-02-23                  23 days
4       2              1 2015-01-25                  NA days
5       2              1 2015-01-28                   3 days
6       2              1 2015-02-21                  24 days
``````
2. try this

``````library(dplyr)

df %>% group_by(user_id, complete_order) %>%
mutate(c1 = order_date - lag(order_date)) %>%
group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% ungroup %>%
mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>%
select(-c(c1, c2))
``````

# Update

for multiple cancelled orders

`````` df %>% mutate(c3=cumsum( complete_order != "0")) %>% group_by(user_id, complete_order) %>%
mutate(c1 = order_date - lag(order_date)) %>%
group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>%
mutate(c2=as.numeric(c2)) %>% group_by(user_id, c3) %>%
mutate(c2=cumsum(ifelse(complete_order==1, 0, c2))) %>% ungroup %>%
mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>%
select(-c(c1, c2, c3))
``````

# logic

`c3` is an `id` every time there is an order (i.e. `complete_order not 0`) to increment by 1.

`c1` calculates the day difference bu `user_id` (but for non complete orders the result is wrong)

`c2` fixes this inconsistency of `c1` with respect to non complete orders.

hope this clears things.

I would suggest you work with combinations of `group_by()` and `mutate(cumsum())` to better understand the results of having more than one grouped variable.

3. It seems that you're looking for the distance of each order from the last completed one. Having a binary vector, `x`, `c(NA, cummax(x * seq_along(x))[-length(x)])` gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.

``````set.seed(1453); x = sample(0:1, 10, TRUE)
set.seed(1821); y = sample(5, 10, TRUE)
cbind(x, y,
last_x = c(NA, cummax(x * seq_along(x))[-length(x)]),
y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
#      x y last_x y_diff
# [1,] 1 3     NA     NA
# [2,] 0 3      1      0
# [3,] 1 5      1      2
# [4,] 0 1      3     -4
# [5,] 0 3      3     -2
# [6,] 1 5      3      0
# [7,] 1 1      6     -4
# [8,] 0 3      7      2
# [9,] 0 4      7      3
#[10,] 1 5      7      4
``````

On your data, first format `df` for convenience:

``````df\$order_date = as.Date(df\$order_date)
df\$complete_order = df\$complete_order == "1"  # lose the 'factor'
``````

And, then, either apply the above approach after a `group_by`:

``````library(dplyr)
df %>% group_by(user_id) %>%
mutate(time_diff = order_date -
order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])
``````

, or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes:

``````# save variables to vectors and keep a "logical" of when "id" changes
id = df\$user_id
id_change = c(TRUE, id[-1] != id[-length(id)])

compl = df\$complete_order
dord = df\$order_date

# accounting for changes in "id", locate last completed order
i = c(NA, cummax((compl | id_change) * seq_along(compl))[-length(compl)])
is.na(i) = id_change

dord - dord[i]
#Time differences in days
#[1] NA  3 11 26 NA  3 12 27
``````