## Subset data by inverse pairs

Question

The following `data.frame` should be subset by inverse pairs and a few conditions:

``````> foo
ID Day  Period            Start              End
1  11   1 morning     Central Park Alphabet Village
2  11   1 morning     Central Park Alphabet Village
3  11   1 evening Alphabet Village        Grammercy
4  54   1 morning     Union Square        Chinatown
5  67   1 morning          Midtown           Harlem
6  67   1 morning           Harlem          Midtown
7  69   1 morning       Greenpoint Prospect Heights
8  54   1 evening        Chinatown     Union Square
9  77   1 morning       Park Slope     Williamsburg
10 73   1 evening     Williamsburg       Park Slope
11 88   2 morning        Grammercy     Battery Park
12 88   2 morning     Battery Park             SoHo
13 88   2 evening     Battery Park        Grammercy
14 69   2 evening Prospect Heights       Greenpoint
15 88   2 evening        Grammercy     Battery Park
``````

For example, a `Start` and `End` station inverse pair have to fall on the same `Day`, have the same `ID` while the first must occur in the morning and the second in the evening. *EDIT: It should be noted that only one Start-End can be used to pair with an End-Start. That is, once a pair is formed, the original Start-End can no longer be used to form another pair. For example record `15` cannot be paired with record `13` because `13` is already "taken".

The output of the subset would always be an even number. In this case it would be:

``````   ID Day  Period        Start          End
3  54   1 morning Union Square    Chinatown
7  54   1 evening    Chinatown Union Square
10 88   2 morning    Grammercy Battery Park
11 88   2 evening Battery Park    Grammercy
``````

I am not sure if a `subset()` function should be used along with a for loop or how to structure the loop. It should say something like - if `start` and `end` are equal to `end` and `start` of the following row and `ID` = `ID` , `Day` = `Day` and the `Period` of the first record = "morning" while the second record = "evening"

I think the code should start with something like this: `if(foo[i-1,"start"] == foo[i,"end"]) & (foo[i-1,"end"] == foo[i,"start"])` but am not sure. The idea is to keep all inverse pairs that meet those conditions. Any guidance and explanation of steps to take would be appreciated.

The sample data:

``````> dput(foo)
structure(list(ID = c(11L, 11L, 11L, 54L, 67L, 67L, 69L, 54L,
77L, 73L, 88L, 88L, 88L, 69L, 88L), Day = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), Period = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("evening",
"morning"), class = "factor"), Start = structure(c(3L, 3L, 1L,
11L, 8L, 7L, 6L, 4L, 9L, 12L, 5L, 2L, 2L, 10L, 5L), .Label = c("Alphabet Village",
"Battery Park", "Central Park", "Chinatown", "Grammercy", "Greenpoint",
"Harlem", "Midtown", "Park Slope", "Prospect Heights", "Union Square",
"Williamsburg"), class = "factor"), End = structure(c(1L, 1L,
4L, 3L, 6L, 7L, 9L, 11L, 12L, 8L, 2L, 10L, 4L, 5L, 2L), .Label = c("Alphabet Village",
"Battery Park", "Chinatown", "Grammercy", "Greenpoint", "Harlem",
"Midtown", "Park Slope", "Prospect Heights", "SoHo", "Union Square",
"Williamsburg"), class = "factor")), .Names = c("ID", "Day",
"Period", "Start", "End"), class = "data.frame", row.names = c(NA,
-15L))
``````

Show source

## Answers to Subset data by inverse pairs ( 3 )

1. ``````library('data.table')
setDT(foo)

# copy foo data and switch column names for Start and End
foo2 <- copy(foo)
colnames(foo2)[which(colnames(foo2) %in% c('Start', 'End'))] <- c('End', 'Start')

# assign unique id, so that it can be used to track the foo data and exclude foo2 data in the later step
foo[, my_id := 1:.N]
foo2[, my_id := seq(nrow(foo)+1, nrow(foo)+.N)]

#combine foo and foo2 data
foo2 <- rbind(foo, foo2)

# group by ID, Day, Start and End and get the number of rows in it,
# then exclude foo2 data by subsetting with my_id and N > 1.
# Then find distinct period on a given day and exclude it if only
# morning or only evening is present.
# Then group by 'ID', 'Day', 'Start' and 'End' and get the first occurrence by `.SD[1]`.

foo2[,
j = .(.N, Period, my_id),
by = c('ID', 'Day', 'Start', 'End')][i = N > 1 & my_id <= nrow(foo),
j = .SD][, j = N1 := length(unique(Period)),
by = c('ID', 'Day')][i = N1 == 2,
j = .SD[1],
by = c('ID', 'Day', 'Start', 'End')][,j = .(ID, Day, Period, Start, End)]

# Output
#    ID Day  Period        Start          End
# 1: 54   1 morning Union Square    Chinatown
# 2: 54   1 evening    Chinatown Union Square
# 3: 88   2 morning    Grammercy Battery Park
# 4: 88   2 evening Battery Park    Grammercy
``````
2. After grouping by 'ID', 'Day', `filter` the 'Period' where the `unique` element count is greater than 1 (`ndistinct`), then change the `factor` columns to `character` and do a `filter` which matches the condition in OP's post

`````` library(dplyr)
foo %>%
group_by(ID, Day) %>%
filter(n_distinct(Period)>1) %>%
mutate(Start = as.character(Start), End = as.character(End)) %>%
filter(Start[1]==End[n()] & Start[n()] == End[1])
#    ID   Day  Period        Start          End
#  (int) (int)  (fctr)        (chr)        (chr)
#1    54     1 morning Union Square    Chinatown
#2    54     1 evening    Chinatown Union Square
#3    88     2 morning    Grammercy Battery Park
#4    88     2 evening Battery Park    Grammercy
``````

In `dplyr` versions 0.5.0 and above, we can use `mutate_if`

``````foo %>%
group_by(ID, Day) %>%
filter(n_distinct(Period)>1)  %>%
mutate_if(is.factor, as.character) %>%
filter(Start[1]==End[n()] & Start[n()] == End[1])
#     ID   Day  Period        Start          End
#   <int> <int>   <chr>        <chr>        <chr>
#1    54     1 morning Union Square    Chinatown
#2    54     1 evening    Chinatown Union Square
#3    88     2 morning    Grammercy Battery Park
#4    88     2 evening Battery Park    Grammercy
``````
3. In SQL, you would use the self join with union query. Consider the same approach in base R by splitting morning and evening subsets, then merging them on ID, Day, and Start, End (reverse pairings), finally `rbind` then back together splitting corresponding columns:

``````mdf <- setNames(df[df\$Period=='morning',], paste0(colnames(df), "_m"))
edf <- setNames(df[df\$Period=='evening',], paste0(colnames(df), "_e"))

rbind(setNames(merge(mdf, edf,
by.x=c("ID_m", "Day_m", "Start_m", "End_m"),
by.y=c("ID_e", "Day_e", "End_e", "Start_e"))[colnames(mdf)], colnames(df)),
setNames(merge(mdf, edf,
by.x=c("ID_m", "Day_m","Start_m", "End_m"),
by.y=c("ID_e", "Day_e", "End_e", "Start_e"))[c("ID_m", "Day_m", "Period_e", "End_m", "Start_m")], colnames(df)))

#   ID Day  Period        Start          End
# 1 54   1 morning Union Square    Chinatown
# 2 88   2 morning    Grammercy Battery Park
# 3 54   1 evening    Chinatown Union Square
# 4 88   2 evening Battery Park    Grammercy
``````

SQL counterpart (works in MS Access returning exact same output)

``````SELECT t1.*
FROM
(SELECT m.ID, m.Day, m.Period, m.[Start], m.[End]
FROM RDataSet AS m
WHERE (((m.Period)='morning'))) As t1
INNER JOIN
(SELECT e.ID, e.Day, e.Period, e.[Start], e.[End]
FROM RDataSet AS e
WHERE (((e.Period)='evening'))) As t2
ON t1.ID = t2.ID AND t1.Day = t2.Day AND t1.[Start] = t2.[End] AND t1.[End] = t2.[Start]

UNION

SELECT t2.*
FROM
(SELECT m.ID, m.Day, m.Period, m.[Start], m.[End]
FROM RDataSet AS m
WHERE (((m.Period)='morning'))) As t1
INNER JOIN
(SELECT e.ID, e.Day, e.Period, e.[Start], e.[End]
FROM RDataSet AS e
WHERE (((e.Period)='evening'))) As t2
ON t1.ID = t2.ID AND t1.Day = t2.Day AND t1.[Start] = t2.[End] AND t1.[End] = t2.[Start]
``````