R: calculate time difference between specific events

Question

I have the following dataset:

df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
                  complete_order = c(rep(c(1,0,0,1), 2)),
                  order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))  

library(lubridate)
df$order_date = as_date(df$order_date)

user_id complete_order order_date
      1              1 2015-01-28
      1              0 2015-01-31
      1              0 2015-02-08
      1              1 2015-02-23
      2              1 2015-01-25
      2              0 2015-01-28
      2              0 2015-02-06
      2              1 2015-02-21

I'm trying to calculate the difference in days between only completed orders for each user. The desirable outcome would look like this:

user_id complete_order order_date complete_order_time_diff
<fctr>         <fctr>     <date>              <time>
   1              1    2015-01-28             NA days
   1              0    2015-01-31              3 days
   1              0    2015-02-08             11 days
   1              1    2015-02-23             26 days
   2              1    2015-01-25             NA days
   2              0    2015-01-28              3 days
   2              0    2015-02-06             12 days
   2              1    2015-02-21             27 days

when I try this solution:

library(dplyr)

df %>% 
group_by(user_id) %>%
mutate(complete_order_time_diff = order_date[complete_order==1]-lag(order_date[complete_order==1))

it returns the error:

Error: incompatible size (3), expecting 4 (the group size) or 1

Any help with this will be great, thank you!


Show source
| date   | R   | dplyr   | lubridate   2017-01-06 13:01 3 Answers

Answers ( 3 )

  1. 2017-01-06 13:01

    I think you can add a filter function in place of the subsetting with order_date[complete_order == 1] and make sure the order_date (and other variables) are the correct data types by adding stringsAsFactors = F to data.frame()):

    df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
                          complete_order = c(rep(c(1,1,0,1), 2)),
                          order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')),
                    stringsAsFactors = F)  
    
    df$order_date <- lubridate::ymd(df$order_date)
    
    df %>% 
        group_by(user_id) %>% 
        filter(complete_order == 1) %>% 
        mutate(complete_order_time_diff = order_date - lag(order_date))
    

    This returns the time until the next complete order (and NA if there is not one):

      user_id complete_order order_date complete_order_time_diff
        <chr>          <chr>     <date>                   <time>
    1       1              1 2015-01-28                  NA days
    2       1              1 2015-01-31                   3 days
    3       1              1 2015-02-23                  23 days
    4       2              1 2015-01-25                  NA days
    5       2              1 2015-01-28                   3 days
    6       2              1 2015-02-21                  24 days
    
  2. 2017-01-06 13:01

    try this

    library(dplyr)
    
    df %>% group_by(user_id, complete_order) %>% 
       mutate(c1 = order_date - lag(order_date)) %>% 
       group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% ungroup %>% 
       mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>% 
       select(-c(c1, c2))
    

    Update

    for multiple cancelled orders

     df %>% mutate(c3=cumsum( complete_order != "0")) %>% group_by(user_id, complete_order) %>% 
      mutate(c1 = order_date - lag(order_date)) %>% 
      group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% 
      mutate(c2=as.numeric(c2)) %>% group_by(user_id, c3) %>% 
      mutate(c2=cumsum(ifelse(complete_order==1, 0, c2))) %>% ungroup %>% 
      mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>% 
      select(-c(c1, c2, c3))
    

    logic

    c3 is an id every time there is an order (i.e. complete_order not 0) to increment by 1.

    c1 calculates the day difference bu user_id (but for non complete orders the result is wrong)

    c2 fixes this inconsistency of c1 with respect to non complete orders.

    hope this clears things.

    I would suggest you work with combinations of group_by() and mutate(cumsum()) to better understand the results of having more than one grouped variable.

  3. 2017-01-07 15:01

    It seems that you're looking for the distance of each order from the last completed one. Having a binary vector, x, c(NA, cummax(x * seq_along(x))[-length(x)]) gives the indices of the last "1" seen before each element. Then, subtracting each element of "order_date" from the "order_date" at that respective index gives the desired output. E.g.

    set.seed(1453); x = sample(0:1, 10, TRUE)
    set.seed(1821); y = sample(5, 10, TRUE)
    cbind(x, y, 
          last_x = c(NA, cummax(x * seq_along(x))[-length(x)]), 
          y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
    #      x y last_x y_diff
    # [1,] 1 3     NA     NA
    # [2,] 0 3      1      0
    # [3,] 1 5      1      2
    # [4,] 0 1      3     -4
    # [5,] 0 3      3     -2
    # [6,] 1 5      3      0
    # [7,] 1 1      6     -4
    # [8,] 0 3      7      2
    # [9,] 0 4      7      3
    #[10,] 1 5      7      4
    

    On your data, first format df for convenience:

    df$order_date = as.Date(df$order_date)
    df$complete_order = df$complete_order == "1"  # lose the 'factor'
    

    And, then, either apply the above approach after a group_by:

    library(dplyr)
    df %>% group_by(user_id) %>% 
       mutate(time_diff = order_date - 
    order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])
    

    , or, perhaps give a try on operations that avoid grouping (assuming ordered "user_id") after accounting for the indices where "user_id" changes:

    # save variables to vectors and keep a "logical" of when "id" changes
    id = df$user_id
    id_change = c(TRUE, id[-1] != id[-length(id)])
    
    compl = df$complete_order
    dord = df$order_date
    
    # accounting for changes in "id", locate last completed order
    i = c(NA, cummax((compl | id_change) * seq_along(compl))[-length(compl)])
    is.na(i) = id_change
    
    dord - dord[i]
    #Time differences in days
    #[1] NA  3 11 26 NA  3 12 27
    
◀ Go back