writing a function in R that selects the a string based on the first instance of a letter and replaces the string


I have a data frame with several variables like this:

land_unit<-c("0.5ha", "hactares", "ha", "ha", "acre", "3ha", 
              "lima", "limas", "acre", "cunny", "6 cunnies")

I want to write a function that will tidy this data for me as i have many variables in my data frame with a similar format. I would like the function to replace each element based on the first letter that appears in the string. For example if the first letter to appear in the string is "h" I want the whole string replaced by "ha", if "l" then "lima", if "a" then "acre" and if "c" then "kani".

I have searched widely on this but cannot find an answer, however I am aware that there must be a relatively simple solution. Perhaps using regex?

Any help would be greatly appreciated.

Show source
| function   | R   | string   2017-01-04 07:01 2 Answers

Answers ( 2 )

  1. 2017-01-04 07:01

    Based on the description, may be this helps. We use gsubfn to match zero or more characters that are not a letter ([^A-Za-z]*) from the start of the string (^) followed by a single letter captured as a group (([a-z])) followed by other characters (.*) and replace the capture group by a named key/value list

    gsubfn("^[^A-Za-z]*([a-z]).*", list(h = "ha", l="lima", a = "acre", c = "kani"), land_unit)
    #[1] "ha"   "ha"   "ha"   "ha"   "acre" "ha"   "lima" "lima" "acre" "kani" "kani"
  2. 2017-01-04 07:01

    This should also work (with making the lookup table hard-coded and decoupling the data from code):

    land_unit<-c("0.5ha", "hactares", "ha", "ha", "acre", "3ha", 
                 "lima", "limas", "acre", "cunny", "6 cunnies")
    # define a lookup table, decouple the data
    lookup_table <- data.frame(first.letter=c('h', 'l', 'a', 'c'), 
                               replace.str=c('ha', 'lima', 'acre', 'kani'), 
                               stringsAsFactors = FALSE) 
    # extract the matches
    matches <- match(str_match(land_unit, "[^[:alpha:]]*([:alpha:]).*")[,2] , lookup_table[,1]) 
    # replace from lookup table
    ifelse(!is.na(matches), lookup_table[matches,2], land_unit) 
    # [1] "ha"   "ha"   "ha"   "ha"   "acre" "ha"   "lima" "lima" "acre" "kani" "kani"
◀ Go back