Monte Carlo estimate of p-value for F-statistic in ANOVA based on permutation method

Question

I am trying to do a permutation test for ANOVA on (y1,...,yN) with group identifiers g. I am supposed to be using (1)/(g-1) summation of (muhatj - muhat)^2 as the test statistic and muhatj is the jth group sample mean, and muhat=(1/g)summation muhatj.

## data
y <- c(6.59491, 6.564573, 6.696147, 6.321552, 6.588449, 6.853832, 
6.370895, 6.441823, 6.227591, 6.675492, 6.255462, 6.919716, 6.837458, 
6.41374, 6.543782, 6.562947, 6.570343, 6.993634, 6.666261, 7.082319, 
7.210933, 6.547977, 6.330553, 6.309289, 6.913492, 6.597188, 6.247285, 
6.644366, 6.534671, 6.885325, 6.577568, 6.499041, 6.827574, 6.198853, 
6.965038, 6.58837, 6.498529, 6.449476, 6.544842, 6.496817, 6.499526, 
6.709674, 6.946934, 6.23884, 6.517018, 6.206692, 6.491935, 6.039925, 
6.166948, 6.160605, 6.428338, 6.564948, 6.446658, 6.566979, 7.17546, 
6.45031, 6.612242, 6.559798, 6.568082, 6.44193, 6.295211, 6.446384, 
6.658321, 6.369639, 6.066747, 6.345537, 6.727513, 6.677873, 6.889841, 
6.724438, 6.379956, 6.380779, 6.50096, 6.676555, 6.463236, 6.239091, 
6.797642, 6.608025)

## group
g <- structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 2L, 
3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 1L, 
2L), .Label = c("B1", "B2", "B3"), class = "factor")

This is what I have right now, but it is not working when I change it to test for the sample means instead of the F statistic. I'm pretty sure that I need to change T.obs and T.perm to something similar to by(y, g, mean) but I think there is more that I am missing.

n <- length(y) #sample size n
T.obs<- anova(lm(y ~ g))$F[1]   #Observed statistic 
n.perm <- 2000   # we will do 2000 permutations
T.perm <- rep(NA, n.perm)   #A vector to save permutated statistic
for(i in 1:n.perm) {
  y.perm <- sample(y, n, replace=F)   #permute data
  T.perm[i] <- anova(lm(y.perm ~ g))$F[1]   #Permuted statistic
  }
mean(T.perm >= T.obs)   #p-value

Show source
| R   | permutation   | anova   | simulation   | p-value   2016-10-31 05:10 1 Answers

Answers ( 1 )

  1. 2016-10-31 07:10

    I don't really know what you mean by "it is not working". As far as I can see, it works alright, except that it is slightly slow.

    set.seed(0)
    n <- length(y) #sample size n
    T.obs <- anova(lm(y ~ g))$F[1]   #Observed statistic 
    n.perm <- 2000   # we will do 2000 permutations
    T.perm <- rep(NA, n.perm)   #A vector to save permutated statistic
    for(i in 1:n.perm) {
      y.perm <- sample(y, n, replace=F)   #permute data
      T.perm[i] <- anova(lm(y.perm ~ g))$F[1]   #Permuted statistic
      }
    mean(T.perm >= T.obs)
    # [1] 0.4915
    

    This is fairly close to the theoretical value

    anova(lm(y ~ g))$Pr[1]
    # [1] 0.4823429
    

    So, yes, you are doing all correct!


    From the first paragraph of your question it sounds like we want to compute F-statistic ourselves, so the following function does this. There is a switch "use_lm". If set TRUE, it uses anova(lm(y ~ g)) as what is done in your original code. This function aims to make computation of F-statistic and p-value transparent. Also, manual computation is 15 times faster than calling lm and anova (which is an obvious thing...).

    fstat <- function (y, g, use_lm = FALSE) {
      if (!use_lm) {
        ## group mean (like we are fitting a linear model A: `y ~ g`)
        mu_g <- ave(y, g, FUN = mean)
        ## overall mean (like we are fitting a linear model B: `y ~ 1`)
        mu <- mean(y)
        ## RSS (residual sum of squares) for model A
        RSS_A <- drop(crossprod(y - mu_g))
        ## RSS (residual sum of squares) for model B
        RSS_B <- drop(crossprod(y - mu))
        ## increase of RSS from model A to model B
        RSS_inc <- RSS_B - RSS_A
        ## note, according to "partition of squares", we can also compute `RSS_inc` as
        ## RSS_inc <- drop(crossprod(mu_g - mu))
        ## `sigma2` (estimated residual variance) of model A
        sigma2 <- RSS_A / (length(y) - nlevels(g))
        ## F-statistic
        fstatistic <- ( RSS_inc / (nlevels(g) - 1) ) / sigma2
        ## p-value
        pval <- pf(fstatistic, nlevels(g) - 1, length(y) - nlevels(g), lower.tail = FALSE)
        ## retern
        return(c(F = fstatistic, pval = pval))
        }
      else {
        anovalm <- anova(lm(y ~ g))
        return(c(F = anovalm$F[1L], pval = anovalm$Pr[1L]))
        }
      }
    

    Let's first have a check on the validity of this function:

    F_obs <- fstat(y, g)
    #        F      pval 
    #0.7362340 0.4823429 
    
    F_obs <- fstat(y, g, TRUE)
    #        F      pval 
    #0.7362340 0.4823429
    

    Don't be surprised that it is insignificant. Your data does not really suggest significant group difference. Look at the boxplot:

    boxplot(y ~ g)    ## or use "factor" method of `plot` function: `plot(g, y)`
    

    enter image description here

    Now we move on to the permutation. We write another function perm for this purpose. It is actually pretty easy, because we have a nicely defined fstat. All we need to do is to use replicate to wrap up sample + fstat.

    lm is actually very slow:

    library(microbenchmark)
    microbenchmark(fstat(y, g), fstat(y, g, TRUE), times = 200)
    
    #Unit: microseconds
    #              expr     min      lq      mean  median      uq      max neval cld
    #       fstat(y, g)  228.44  235.32  272.1204  275.34  290.20   388.84   200  a 
    # fstat(y, g, TRUE) 4090.00 4136.72 4424.0470 4181.02 4450.12 16460.72   200   b
    

    so we write this function using f(..., use_lm = FALSE):

    perm <- function (y, g, n) replicate(n, fstat(sample(y), g)[[1L]])
    

    Now let's run it with n = 2000 (setting random seed for reproducibility):

    set.seed(0)
    F_perm <- perm(y, g, 2000)
    
    ## estimated p-value based on permutation
    mean(F_perm > F_obs[[1L]])
    # [1] 0.4915
    

    Note how close it is to the theoretical p-value:

    F_obs[[2L]]
    # [1] 0.4823429
    

    As you can see, the result agrees with your original code.

◀ Go back