## R: calculate time difference between specific events

Question

I have the following dataset:

```
df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
complete_order = c(rep(c(1,0,0,1), 2)),
order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))
library(lubridate)
df$order_date = as_date(df$order_date)
user_id complete_order order_date
1 1 2015-01-28
1 0 2015-01-31
1 0 2015-02-08
1 1 2015-02-23
2 1 2015-01-25
2 0 2015-01-28
2 0 2015-02-06
2 1 2015-02-21
```

I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:

```
user_id complete_order order_date complete_order_time_diff
<fctr> <fctr> <date> <time>
1 1 2015-01-28 NA days
1 0 2015-01-31 3 days
1 0 2015-02-08 11 days
1 1 2015-02-23 26 days
2 1 2015-01-25 NA days
2 0 2015-01-28 3 days
2 0 2015-02-06 12 days
2 1 2015-02-21 27 days
```

when I try this solution:

```
library(dplyr)
df %>%
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))
```

it returns the error:

`Error: incompatible size (3), expecting 4 (the group size) or 1`

Any help with this will be great, thank you!

Show source

## Answers ( 3 )

I think you can add a

`filter`

function in place of the subsetting with`order_date[complete_order == 1]`

and make sure the`order_date`

(and other variables) are the correct data types by adding`stringsAsFactors = F`

to`data.frame()`

):This returns the time until the next complete order (and

`NA`

if there is not one):try this

## Update

for multiple cancelled orders

## logic

`c3`

is an`id`

every time there is an order (i.e.`complete_order not 0`

) to increment by 1.`c1`

calculates the day difference bu`user_id`

(but for non complete orders the result is wrong)`c2`

fixes this inconsistency of`c1`

with respect to non complete orders.hope this clears things.

I would suggest you work with combinations of

`group_by()`

and`mutate(cumsum())`

to better understand the results of having more than one grouped variable.It seems that you're looking for the distance of each order from the last completed one. Having a binary vector,

`x`

,`c(NA, cummax(x * seq_along(x))[-length(x)])`

gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.On your data, first format

`df`

for convenience:And, then, either apply the above approach after a

`group_by`

:, or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes: