applying a function with multiple arguments over multiple paired variables in R

Question

I have a function like this which im using to clean data and works correctly.

my_fun <- function (x, y){
    y <- ifelse(str_detect(x, "-*\\d+\\.*\\d*"), 
        as.numeric(str_extract(x, "-*\\d+\\.*\\d*")),
        as.numeric(y))
}

It takes numbers that have been entered in the wrong column and reassigns them to the correct column. It is used as follows to clean the y variable:

df$y <- my_fun(x, y)

I have many columns/variables (more than 10) that are paired in the same format something like this

x_vars <- c("x_1", "x_2", "x_3", "x_4", "x_5", "x_6")
y_vars <- c("y_1", "y_2", "y_3", "y_4", "y_5", "y_6")

My question is. Is there a way to apply this function across all the variables in my data set that need to be cleaned in the same way? I can easily do this in other instances where my data cleaning function has only one argument using lapply but am struggling in this case.

I have tried mapply but could not get it to work, this might be because I'm still quite a novice in R. Any advice would be much appreciated.


Show source
| function   | R   | lapply   | data-cleaning   | mapply   2017-01-06 06:01 2 Answers

Answers ( 2 )

  1. 2017-01-06 06:01

    We can use mapply/Map. We need to extract the columns based on the column names by passing the 'x_vars', 'y_vars' as arguments to Map, apply the my_fun on the extracted the vectors, and assign it back to 'y_vars' in the original dataset

    df[y_vars] <- Map(function(x,y) my_fun(df[,x], df[,y]), x_vars, y_vars)
    

    Or this can be also written as

    df[y_vars] <- Map(my_fun, df[x_vars], df[y_vars])
    

    NOTE: Here, we are assuming that all the elements in 'x_vars' and 'y_vars' are columns in the original dataset. We would also state that using Map will be much more faster and efficient than reshaping it to long and then do some conversion.


    To provide a different approach, we can use the melt from data.table

    library(data.table)
    dM <- melt(setDT(df), measure = list(x_vars, y_vars))[,
                   value3 := my_fun(value1, value2), variable]
    

    Then, again, we need to dcast it back to 'wide' format. So, it is requires more steps and not much easy

    setnames(dcast(dM, rowid(variable)~variable, 
      value.var = c("value1", "value3"))[,variable := NULL][], c(x_vars, y_vars))[]
    

    data

    set.seed(24)
    df <- as.data.frame(matrix(sample(c(1:5, "something 10.5",
       "this -4.5", "what -5.2 value?"),
              12*10, replace=TRUE), ncol=12, dimnames = 
         list(NULL, c(x_vars, y_vars))), stringsAsFactors=FALSE)
    
  2. 2017-01-06 09:01

    B/c I always think it's good to know how to do this stuff in base R, I've got exmaples of how to use mapply() and lapply().

    ## first generate some data
    df <- data.frame(replicate(12, rnorm(5)))
    my_fun <- function (x, y){
        ifelse(stringr::str_detect(x, "-*\\d+\\.*\\d*"),
            as.numeric(stringr::str_extract(x, "-*\\d+\\.*\\d*")),
            as.numeric(y))
    }
    df <- data.frame(replicate(12, rnorm(3)))
    df[, sample(1:6, 3)] <- letters[1:3]
    ## not function of interest, but good mapply() example
    names(df) <- c(
                   mapply(paste0, rep("x_", 6), 1:6),
                   mapply(paste0, rep("y_", 6), 1:6))
    
    ## print data with problem variables (cols with letters)
    #df
    #         x_1 x_2 x_3 x_4        x_5        x_6       y_1
    #1 -0.2184993   a   a   a -0.1587070 0.37795630 0.6162796
    #2  0.8511775   b   b   b  0.5743287 0.15291219 1.0594502
    #3  0.8183208   c   c   c  1.8923812 0.07156925 0.8613535
    #         y_2        y_3        y_4       y_5        y_6
    #1  0.3240393 -1.1084067  0.5233168 0.3712705 -0.3911407
    #2  0.3044824 -0.2286032 -1.0019870 1.2156441  0.4010163
    #3 -1.0920677  1.3408504  1.3339865 0.3270800 -0.8416253
    
    
    
    ## if you wrote a for loop, it'd look like this maybe
    out <- vector("list", 6)
    for (i in seq_len(6)) {
        out[[i]] <- my_fun(df[, i], df[, i + 6])
    }
    
    ## same construction can be used with lapply
    dfy <- lapply(seq_len(6), function(i)
        my_fun(df[, 1:6][[i]],
               df[, 7:12][[i]]))
    matrix(unlist(dfy), 5, 6)
    #           [,1]       [,2]       [,3]        [,4]       [,5]
    #[1,] -0.2184993 -1.0920677 -1.0019870  0.37795630  0.8183208
    #[2,]  0.8511775 -1.1084067  1.3339865  0.15291219  0.3240393
    #[3,]  0.8183208 -0.2286032 -0.1587070  0.07156925  0.3044824
    #[4,]  0.3240393  1.3408504  0.5743287 -0.21849928 -1.0920677
    #[5,]  0.3044824  0.5233168  1.8923812  0.85117750 -1.1084067
    #           [,6]
    #[1,] -0.2286032
    #[2,]  1.3408504
    #[3,]  0.5233168
    #[4,] -1.0019870
    #[5,]  1.3339865
    

    Warning message: In matrix(unlist(dfy), 5, 6) : data length [18] is not a sub-multiple or multiple of the number of rows [5]

    ## and mapply makes this even easier
    mapply(my_fun, df[, 1:6], df[, 7:12])
    #            x_1        x_2        x_3        x_4        x_5
    #[1,] -0.2184993  0.3240393 -1.1084067  0.5233168 -0.1587070
    #[2,]  0.8511775  0.3044824 -0.2286032 -1.0019870  0.5743287
    #[3,]  0.8183208 -1.0920677  1.3408504  1.3339865  1.8923812
    #            x_6
    #[1,] 0.37795630
    #[2,] 0.15291219
    #[3,] 0.07156925
    
◀ Go back