How to keep only information inside a complex string in R?

Question

I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \" and \" in Function=\"SMAD5\". I also want to keep the empty strings: Function=\"\"

df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";", 
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";", 
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";", 
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";", 
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")

This should look like this:

"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA

So far that What I was able to do:

gsub('.*Function=\"',"",df)

[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";"      "\";"      "\";"     

But I'm stuck with a bunch of \";". How can I remove them with one line?

I tried this:

gsub('.*Function=\"' & '.\"*',"",test)

But it's giving me this error:

Error in ".*Function=\"" & ".\"*" : 
  operations are possible only for numeric, logical or complex types

Show source
| R   | regex   | split   | gsub   2017-01-06 20:01 3 Answers

Answers ( 3 )

  1. 2017-01-06 20:01

    You may use

    gsub(".*Function=\"([^\"]*).*","\\1",df)
    

    See the regex demo

    Details:

    • .* - any 0+ chars as many as possible up to the last...
    • Function=\" - a Function=" substring
    • ([^\"]*) - capturing group 1 matching 0+ chars other than a "
    • .* - and the rest of the string.

    The \1 is the backreference restoring the contents of the Group 1 in the result.

  2. 2017-01-06 21:01

    With stringr we can capture groups too:

    library(stringr)
    matches <- str_match(df, ".*\"(.*)\".*")[,2]
    ifelse(matches=='', NA, matches)
    # [1] "SMAD5" "CENPA" "C1QL4" NA      NA      NA     
    
  3. 2017-01-06 22:01

    The regular expression can be constructed more readably using rebus.

    rx <- 'Function="' %R% 
      capture(zero_or_more(negated_char_class('"')))
    

    Then matching is as mentioned by Wiktor and sandipan.

    rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
    str_match(df, rx)
    stri_match_first_regex(df, rx)
    
    gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)
    
◀ Go back