How to keep only information inside a complex string in R?


I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \" and \" in Function=\"SMAD5\". I also want to keep the empty strings: Function=\"\"

df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";", 
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";", 
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")

This should look like this:


So far that What I was able to do:


[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";"      "\";"      "\";"     

But I'm stuck with a bunch of \";". How can I remove them with one line?

I tried this:

gsub('.*Function=\"' & '.\"*',"",test)

But it's giving me this error:

Error in ".*Function=\"" & ".\"*" : 
  operations are possible only for numeric, logical or complex types

Show source
| R   | regex   | split   | gsub   2017-01-06 20:01 3 Answers

Answers ( 3 )

  1. 2017-01-06 20:01

    You may use


    See the regex demo


    • .* - any 0+ chars as many as possible up to the last...
    • Function=\" - a Function=" substring
    • ([^\"]*) - capturing group 1 matching 0+ chars other than a "
    • .* - and the rest of the string.

    The \1 is the backreference restoring the contents of the Group 1 in the result.

  2. 2017-01-06 21:01

    With stringr we can capture groups too:

    matches <- str_match(df, ".*\"(.*)\".*")[,2]
    ifelse(matches=='', NA, matches)
    # [1] "SMAD5" "CENPA" "C1QL4" NA      NA      NA     
  3. 2017-01-06 22:01

    The regular expression can be constructed more readably using rebus.

    rx <- 'Function="' %R% 

    Then matching is as mentioned by Wiktor and sandipan.

    rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
    str_match(df, rx)
    stri_match_first_regex(df, rx)
    gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)
◀ Go back