Subset data by inverse pairs

Question

The following data.frame should be subset by inverse pairs and a few conditions:

> foo
   ID Day  Period            Start              End
1  11   1 morning     Central Park Alphabet Village
2  11   1 morning     Central Park Alphabet Village
3  11   1 evening Alphabet Village        Grammercy
4  54   1 morning     Union Square        Chinatown
5  67   1 morning          Midtown           Harlem
6  67   1 morning           Harlem          Midtown
7  69   1 morning       Greenpoint Prospect Heights
8  54   1 evening        Chinatown     Union Square
9  77   1 morning       Park Slope     Williamsburg
10 73   1 evening     Williamsburg       Park Slope
11 88   2 morning        Grammercy     Battery Park
12 88   2 morning     Battery Park             SoHo
13 88   2 evening     Battery Park        Grammercy
14 69   2 evening Prospect Heights       Greenpoint
15 88   2 evening        Grammercy     Battery Park

For example, a Start and End station inverse pair have to fall on the same Day, have the same ID while the first must occur in the morning and the second in the evening. *EDIT: It should be noted that only one Start-End can be used to pair with an End-Start. That is, once a pair is formed, the original Start-End can no longer be used to form another pair. For example record 15 cannot be paired with record 13 because 13 is already "taken".

The output of the subset would always be an even number. In this case it would be:

   ID Day  Period        Start          End
3  54   1 morning Union Square    Chinatown
7  54   1 evening    Chinatown Union Square
10 88   2 morning    Grammercy Battery Park
11 88   2 evening Battery Park    Grammercy

I am not sure if a subset() function should be used along with a for loop or how to structure the loop. It should say something like - if start and end are equal to end and start of the following row and ID = ID , Day = Day and the Period of the first record = "morning" while the second record = "evening"

I think the code should start with something like this: if(foo[i-1,"start"] == foo[i,"end"]) & (foo[i-1,"end"] == foo[i,"start"]) but am not sure. The idea is to keep all inverse pairs that meet those conditions. Any guidance and explanation of steps to take would be appreciated.

The sample data:

> dput(foo)
structure(list(ID = c(11L, 11L, 11L, 54L, 67L, 67L, 69L, 54L, 
77L, 73L, 88L, 88L, 88L, 69L, 88L), Day = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), Period = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("evening", 
"morning"), class = "factor"), Start = structure(c(3L, 3L, 1L, 
11L, 8L, 7L, 6L, 4L, 9L, 12L, 5L, 2L, 2L, 10L, 5L), .Label = c("Alphabet Village", 
"Battery Park", "Central Park", "Chinatown", "Grammercy", "Greenpoint", 
"Harlem", "Midtown", "Park Slope", "Prospect Heights", "Union Square", 
"Williamsburg"), class = "factor"), End = structure(c(1L, 1L, 
4L, 3L, 6L, 7L, 9L, 11L, 12L, 8L, 2L, 10L, 4L, 5L, 2L), .Label = c("Alphabet Village", 
"Battery Park", "Chinatown", "Grammercy", "Greenpoint", "Harlem", 
"Midtown", "Park Slope", "Prospect Heights", "SoHo", "Union Square", 
"Williamsburg"), class = "factor")), .Names = c("ID", "Day", 
"Period", "Start", "End"), class = "data.frame", row.names = c(NA, 
-15L))

Show source
| R   | loops   | subset   2017-01-07 05:01 3 Answers

Answers ( 3 )

  1. 2017-01-07 05:01
    library('data.table')
    setDT(foo)
    
    # copy foo data and switch column names for Start and End
    foo2 <- copy(foo)
    colnames(foo2)[which(colnames(foo2) %in% c('Start', 'End'))] <- c('End', 'Start')
    
    # assign unique id, so that it can be used to track the foo data and exclude foo2 data in the later step
    foo[, my_id := 1:.N]
    foo2[, my_id := seq(nrow(foo)+1, nrow(foo)+.N)]
    
    #combine foo and foo2 data
    foo2 <- rbind(foo, foo2)
    
    # group by ID, Day, Start and End and get the number of rows in it, 
    # then exclude foo2 data by subsetting with my_id and N > 1. 
    # Then find distinct period on a given day and exclude it if only
    # morning or only evening is present. 
    # Then group by 'ID', 'Day', 'Start' and 'End' and get the first occurrence by `.SD[1]`.
    
    foo2[, 
         j = .(.N, Period, my_id), 
         by = c('ID', 'Day', 'Start', 'End')][i = N > 1 & my_id <= nrow(foo), 
                                              j = .SD][, j = N1 := length(unique(Period)), 
                                                         by = c('ID', 'Day')][i = N1 == 2,
                                                                              j = .SD[1], 
                                                                              by = c('ID', 'Day', 'Start', 'End')][,j = .(ID, Day, Period, Start, End)]
    
    
    # Output
    #    ID Day  Period        Start          End
    # 1: 54   1 morning Union Square    Chinatown
    # 2: 54   1 evening    Chinatown Union Square
    # 3: 88   2 morning    Grammercy Battery Park
    # 4: 88   2 evening Battery Park    Grammercy
    
  2. 2017-01-07 06:01

    After grouping by 'ID', 'Day', filter the 'Period' where the unique element count is greater than 1 (ndistinct), then change the factor columns to character and do a filter which matches the condition in OP's post

     library(dplyr)
     foo %>%
         group_by(ID, Day) %>%
         filter(n_distinct(Period)>1) %>% 
         mutate(Start = as.character(Start), End = as.character(End)) %>%
         filter(Start[1]==End[n()] & Start[n()] == End[1]) 
     #    ID   Day  Period        Start          End
     #  (int) (int)  (fctr)        (chr)        (chr)
     #1    54     1 morning Union Square    Chinatown
     #2    54     1 evening    Chinatown Union Square
     #3    88     2 morning    Grammercy Battery Park
     #4    88     2 evening Battery Park    Grammercy
    

    In dplyr versions 0.5.0 and above, we can use mutate_if

    foo %>%
       group_by(ID, Day) %>% 
       filter(n_distinct(Period)>1)  %>% 
       mutate_if(is.factor, as.character) %>%
       filter(Start[1]==End[n()] & Start[n()] == End[1]) 
    #     ID   Day  Period        Start          End
    #   <int> <int>   <chr>        <chr>        <chr>
    #1    54     1 morning Union Square    Chinatown
    #2    54     1 evening    Chinatown Union Square
    #3    88     2 morning    Grammercy Battery Park
    #4    88     2 evening Battery Park    Grammercy
    
  3. 2017-01-07 06:01

    In SQL, you would use the self join with union query. Consider the same approach in base R by splitting morning and evening subsets, then merging them on ID, Day, and Start, End (reverse pairings), finally rbind then back together splitting corresponding columns:

    mdf <- setNames(df[df$Period=='morning',], paste0(colnames(df), "_m"))
    edf <- setNames(df[df$Period=='evening',], paste0(colnames(df), "_e"))
    
    rbind(setNames(merge(mdf, edf,
                        by.x=c("ID_m", "Day_m", "Start_m", "End_m"), 
                        by.y=c("ID_e", "Day_e", "End_e", "Start_e"))[colnames(mdf)], colnames(df)),
          setNames(merge(mdf, edf,
                         by.x=c("ID_m", "Day_m","Start_m", "End_m"), 
                         by.y=c("ID_e", "Day_e", "End_e", "Start_e"))[c("ID_m", "Day_m", "Period_e", "End_m", "Start_m")], colnames(df)))
    
    #   ID Day  Period        Start          End
    # 1 54   1 morning Union Square    Chinatown
    # 2 88   2 morning    Grammercy Battery Park
    # 3 54   1 evening    Chinatown Union Square
    # 4 88   2 evening Battery Park    Grammercy
    

    SQL counterpart (works in MS Access returning exact same output)

    SELECT t1.*
    FROM
       (SELECT m.ID, m.Day, m.Period, m.[Start], m.[End]
        FROM RDataSet AS m
        WHERE (((m.Period)='morning'))) As t1
    INNER JOIN
       (SELECT e.ID, e.Day, e.Period, e.[Start], e.[End]
        FROM RDataSet AS e
        WHERE (((e.Period)='evening'))) As t2
    ON t1.ID = t2.ID AND t1.Day = t2.Day AND t1.[Start] = t2.[End] AND t1.[End] = t2.[Start]
    
    UNION
    
    SELECT t2.*
    FROM
       (SELECT m.ID, m.Day, m.Period, m.[Start], m.[End]
        FROM RDataSet AS m
        WHERE (((m.Period)='morning'))) As t1
    INNER JOIN
       (SELECT e.ID, e.Day, e.Period, e.[Start], e.[End]
        FROM RDataSet AS e
        WHERE (((e.Period)='evening'))) As t2
    ON t1.ID = t2.ID AND t1.Day = t2.Day AND t1.[Start] = t2.[End] AND t1.[End] = t2.[Start]
    
◀ Go back