## Partial Match of text in R

Question

I have one data set, which contains ids and their corresponding phrase. One Id can have phrases of 2 or 3 words. Within one Id, if we have phrases of 2 or 3 words, match the phrase of 2 words with 3 words phrase. If matched, retain 2 words phrase and delete 3 words phrase.

`````` Data:
id         text
11    XYX not working
11    cant find anything
11    wont let go
11    wont let open
11    not working
11    let open
12    no music store
12    no sound store
12    not playing
12    not printing
12    no music
13    paper issue
13    charger issue
14    no issue found
``````

Example: In id(11) 'let open' matches with 'wont let open'. So delete 'wont let open' and retain 'let open'. 'not working' matches with 'XYX not working', so retain 'not working'. Also retain other phrases which are not matched. Always we need to match the phrases where ever we have 2 and 3 words phrases in particular id.

`````` Expected output:

id          text
11    cant find anything
11    wont let go
11    not working
11    let open
12    no sound store
12    not playing
12    not printing
12    no music
13    paper issue
13    charger issue
14    no issue found
``````

Thank you in advance!

Show source
2017-01-06 17:01 3 Answers

## Answers to Partial Match of text in R ( 3 )

1. One idea is to create a custom function and apply it to the data set

``````library(dplyr)
library(stringi)

fun1 <- function(x){
if(length(x) > 1) {
m1 <- expand.grid(x[stri_count_words(x) == 3], x[stri_count_words(x) == 2])
ind <- unique(m1[apply(m1, 1, function(i)length(Reduce(`intersect`, stri_extract_all_words(i)))) == 2,1])
}
return(as.character(ind))
}

df %>%
group_by(id) %>%
filter(!text %in% fun1(text))

#Source: local data frame [11 x 2]
#Groups: id [4]

#      id               text
#   <int>              <chr>
#1     11        not working
#2     11           let open
#3     11 cant find anything
#4     11        wont let go
#5     12        not playing
#6     12       not printing
#7     12           no music
#8     12     no sound store
#9     13        paper issue
#10    13      charger issue
#11    14     no issue found
``````
2. Here is a solution using the `tidyverse` family of packages:

``````library(stringr)
library(tidyverse)

is_long_phrase <- function(x) {
map_lgl(x, ~ !any(str_detect(.x, setdiff(x, .x))))
}

data %>%
group_by(id) %>%
filter(is_long_phrase(text)) %>%
ungroup()
``````
3. Try this:

``````# the data
11,XYX not working
11,cant find anything
11,wont let go
11,wont let open
11,not working
11,let open
12,no music store
12,no sound store
12,not playing
12,not printing
12,no music
13,paper issue
13,charger issue
14,no issue found', header=TRUE, stringsAsFactors=FALSE)

# the code
df\$words <- lapply(strsplit(df\$text, split='\\s+'), length) # words in text
df.idlst <- split(df, df\$id)
Vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df\$del <- unlist(lapply(df.idlst, function(df) sapply(1:nrow(df), function(i) ifelse(df[i,]\$words == 3, any(Vgrepl(df[df\$words==2,]\$text, df[i,]\$text)), FALSE))))
df[!df\$del,][1:2] # df[row,]\$del == TRUE => the row has to be deleted

# the output
id               text
2  11 cant find anything
3  11        wont let go
5  11        not working
6  11           let open
8  12     no sound store
9  12        not playing
10 12       not printing
11 12           no music
12 13        paper issue
13 13      charger issue
14 14     no issue found
``````