Partial Match of text in R

Question

I have one data set, which contains ids and their corresponding phrase. One Id can have phrases of 2 or 3 words. Within one Id, if we have phrases of 2 or 3 words, match the phrase of 2 words with 3 words phrase. If matched, retain 2 words phrase and delete 3 words phrase.

 Data:
          id         text
          11    XYX not working
          11    cant find anything
          11    wont let go
          11    wont let open
          11    not working
          11    let open
          12    no music store
          12    no sound store
          12    not playing
          12    not printing
          12    no music
          13    paper issue
          13    charger issue
          14    no issue found

Example: In id(11) 'let open' matches with 'wont let open'. So delete 'wont let open' and retain 'let open'. 'not working' matches with 'XYX not working', so retain 'not working'. Also retain other phrases which are not matched. Always we need to match the phrases where ever we have 2 and 3 words phrases in particular id.

 Expected output:

          id          text
          11    cant find anything
          11    wont let go
          11    not working
          11    let open
          12    no sound store
          12    not playing
          12    not printing
          12    no music
          13    paper issue
          13    charger issue
          14    no issue found

Thank you in advance!


Show source
| R   | grep   2017-01-06 17:01 3 Answers

Answers ( 3 )

  1. 2017-01-06 18:01

    One idea is to create a custom function and apply it to the data set

    library(dplyr)
    library(stringi)
    
    fun1 <- function(x){
      if(length(x) > 1) {
        m1 <- expand.grid(x[stri_count_words(x) == 3], x[stri_count_words(x) == 2])
        ind <- unique(m1[apply(m1, 1, function(i)length(Reduce(`intersect`, stri_extract_all_words(i)))) == 2,1])
      }
      return(as.character(ind))
    }
    
    df %>% 
      group_by(id) %>% 
      filter(!text %in% fun1(text))
    
    #Source: local data frame [11 x 2]
    #Groups: id [4]
    
    #      id               text
    #   <int>              <chr>
    #1     11        not working
    #2     11           let open
    #3     11 cant find anything
    #4     11        wont let go
    #5     12        not playing
    #6     12       not printing
    #7     12           no music
    #8     12     no sound store
    #9     13        paper issue
    #10    13      charger issue
    #11    14     no issue found
    
  2. 2017-01-06 19:01

    Here is a solution using the tidyverse family of packages:

    library(stringr)
    library(tidyverse)
    
    is_long_phrase <- function(x) {
      map_lgl(x, ~ !any(str_detect(.x, setdiff(x, .x))))
    }
    
    data %>%
      group_by(id) %>% 
      filter(is_long_phrase(text)) %>% 
      ungroup()
    
  3. 2017-01-06 22:01

    Try this:

    # the data
    df <- read.csv(text='id,text
                     11,XYX not working
                     11,cant find anything
                     11,wont let go
                     11,wont let open
                     11,not working
                     11,let open
                     12,no music store
                     12,no sound store
                     12,not playing
                     12,not printing
                     12,no music
                     13,paper issue
                     13,charger issue
                     14,no issue found', header=TRUE, stringsAsFactors=FALSE)
    
    # the code
    df$words <- lapply(strsplit(df$text, split='\\s+'), length) # words in text
    df.idlst <- split(df, df$id) 
    Vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
    df$del <- unlist(lapply(df.idlst, function(df) sapply(1:nrow(df), function(i) ifelse(df[i,]$words == 3, any(Vgrepl(df[df$words==2,]$text, df[i,]$text)), FALSE))))
    df[!df$del,][1:2] # df[row,]$del == TRUE => the row has to be deleted
    
    # the output
       id               text
    2  11 cant find anything
    3  11        wont let go
    5  11        not working
    6  11           let open
    8  12     no sound store
    9  12        not playing
    10 12       not printing
    11 12           no music
    12 13        paper issue
    13 13      charger issue
    14 14     no issue found
    
◀ Go back