Genome 560 Spring, 2011
Statistics for Genomicists J. Felsenstein
Exercise #8
Remember the code I just gave you for bootstrapping. It takes
a vector of numbers, makes an array that has that repeated as
each column (a large number, say 1000) of times:
We will do a bootstrap sampling, 1000 times, from a sample of
gene expression values. We will read in gene expression values
for 200 loci, and then use bootstrapping to put a confidence
interval on the standard deviation of the locus-to-locus values.
We make up a matrix which has the data set in each column,
then we need to sample with replacement from each column. To do
this we make up a "wrapper" function, "samp" which is simply
a function of one argument which samples from a vector with
replacement. Then we use the "apply" function to apply "samp"
to each column. After that we use "apply" to take the standard
deviation of each of those bootstrap samples (one per column).
x <- read.table("genes1000",header=TRUE) # read in values for two individuals
a <- x[1:200,3] # take the first 100 genes from one of them
# as our data set. Why this way? Don't ask.
You might want to look at the histogram of a to see how normal it looks.
Then:
m <- length(a) # the data sample size
b <- rep(a, times=1000) # 1000 times, end-to-end
d <- matrix(b, m, 1000) # makes an array with a in each column
samp <- function(x) {return(sample(x, m, replace=TRUE))} # the wrapper
e <- apply(d, 2, samp) # do samp on each column
f <- apply(e, 2, sd) # in that case we wanted the means
We want to compute a confidence interval on the standard deviation. We now
have a vector (f) of 1000 estimates of the standard deviation. To get the
upper and lower 2.5% points, what we do is take the empirical percentiles of
that vector. Do this (remember "quantile"?).
Also make a histogram (with maybe 40 classes) of the distribution of estimates
of the standard deviation.
2. If you have time (if ...) you might want to take a sample x from
an exponential distribution with expectation 1, and another
sample y from an exponential distribution with expectation 2.
The function for drawing (say) 200 values from an exponential
distribution with expectation 1 is rexp(200, rate=1), for
expectation two rexp(200, rate=1/2)
Use a permutation test to test whether their means are different.
The code I gave to do this was (don't do this immediately!):
m <- length(x)
n <- length(y)
z <- c(x, y ) # concatenate them into one vector
w <- sample(z, m+n, replace=FALSE) # permutes that vector
v1 <- w[1:m] # the first m
v2 <- w[m+1:m+n] # the next n
mean(v1)-mean(v2) # difference of means
But of course you'll have to do this with something like the
above code that we used for bootstrapping. Namely, take your
two samples and make a matrix that has z down each column.
For this you need to use "rep" to make a vector with 200 x 1000
values in it, namely z repeated end-to-end 1000 times.
Then make that into a matrix as done above.
This time you can simply apply "sample" to each column (without making
a wrapper function, by doing
b <- apply(a, 2, sample) % where is the 400 x 1000 matrix
(this is possible because the default behavior of "sample" is to
sample without replacement a number of times equal to the length of
the vector which is being sampled from).
To compute the differences of means we will need a wrapper function,
fun <- function(x){mean(x[1:100])-mean(x[101:200])}
Then we apply it to the columns of the array b. Looking at the
quantiles of the resulting vector of sampled differences of means,
how can be test whether the difference of means is zero?