## Monte Carlo estimate of p-value for F-statistic in ANOVA based on permutation method

Question

I am trying to do a permutation test for ANOVA on (y_{1},...,y_{N}) with group identifiers `g`

. I am supposed to be using (1)/(g-1) summation of (muhat_{j} - muhat)^2 as the test statistic and muhat_{j} is the jth group sample mean, and muhat=(1/g)summation muhat_{j}.

```
## data
y <- c(6.59491, 6.564573, 6.696147, 6.321552, 6.588449, 6.853832,
6.370895, 6.441823, 6.227591, 6.675492, 6.255462, 6.919716, 6.837458,
6.41374, 6.543782, 6.562947, 6.570343, 6.993634, 6.666261, 7.082319,
7.210933, 6.547977, 6.330553, 6.309289, 6.913492, 6.597188, 6.247285,
6.644366, 6.534671, 6.885325, 6.577568, 6.499041, 6.827574, 6.198853,
6.965038, 6.58837, 6.498529, 6.449476, 6.544842, 6.496817, 6.499526,
6.709674, 6.946934, 6.23884, 6.517018, 6.206692, 6.491935, 6.039925,
6.166948, 6.160605, 6.428338, 6.564948, 6.446658, 6.566979, 7.17546,
6.45031, 6.612242, 6.559798, 6.568082, 6.44193, 6.295211, 6.446384,
6.658321, 6.369639, 6.066747, 6.345537, 6.727513, 6.677873, 6.889841,
6.724438, 6.379956, 6.380779, 6.50096, 6.676555, 6.463236, 6.239091,
6.797642, 6.608025)
## group
g <- structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 2L,
3L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 1L,
2L), .Label = c("B1", "B2", "B3"), class = "factor")
```

This is what I have right now, but it is not working when I change it to test for the sample means instead of the F statistic. I'm pretty sure that I need to change `T.obs`

and `T.perm`

to something similar to `by(y, g, mean)`

but I think there is more that I am missing.

```
n <- length(y) #sample size n
T.obs<- anova(lm(y ~ g))$F[1] #Observed statistic
n.perm <- 2000 # we will do 2000 permutations
T.perm <- rep(NA, n.perm) #A vector to save permutated statistic
for(i in 1:n.perm) {
y.perm <- sample(y, n, replace=F) #permute data
T.perm[i] <- anova(lm(y.perm ~ g))$F[1] #Permuted statistic
}
mean(T.perm >= T.obs) #p-value
```

Show source

## Answers ( 1 )

I don't really know what you mean by "

it is not working". As far as I can see, it works alright, except that it is slightly slow.This is fairly close to the theoretical value

So, yes, you are doing all correct!

From the first paragraph of your question it sounds like we want to compute F-statistic ourselves, so the following function does this. There is a switch

`"use_lm"`

. If set`TRUE`

, it uses`anova(lm(y ~ g))`

as what is done in your original code.This function aims to make computation of F-statistic and p-value transparent. Also, manual computation is 15 times faster than calling`lm`

and`anova`

(which is an obvious thing...).Let's first have a check on the validity of this function:

Don't be surprised that it is insignificant. Your data does not really suggest significant group difference. Look at the boxplot:

Now we move on to the permutation. We write another function

`perm`

for this purpose. It is actually pretty easy, because we have a nicely defined`fstat`

. All we need to do is to use`replicate`

to wrap up`sample`

+`fstat`

.`lm`

is actually very slow:so we write this function using

`f(..., use_lm = FALSE)`

:Now let's run it with

`n = 2000`

(setting random seed for reproducibility):Note how close it is to the theoretical p-value:

As you can see, the result agrees with your original code.