Genome 560 Spring, 2011 Statistics for Genomicists J. Felsenstein Exercise #8 Remember the code I just gave you for bootstrapping. It takes a vector of numbers, makes an array that has that repeated as each column (a large number, say 1000) of times: We will do a bootstrap sampling, 1000 times, from a sample of gene expression values. We will read in gene expression values for 200 loci, and then use bootstrapping to put a confidence interval on the standard deviation of the locus-to-locus values. We make up a matrix which has the data set in each column, then we need to sample with replacement from each column. To do this we make up a "wrapper" function, "samp" which is simply a function of one argument which samples from a vector with replacement. Then we use the "apply" function to apply "samp" to each column. After that we use "apply" to take the standard deviation of each of those bootstrap samples (one per column). x <- read.table("genes1000",header=TRUE) # read in values for two individuals a <- x[1:200,3] # take the first 100 genes from one of them # as our data set. Why this way? Don't ask. You might want to look at the histogram of a to see how normal it looks. Then: m <- length(a) # the data sample size b <- rep(a, times=1000) # 1000 times, end-to-end d <- matrix(b, m, 1000) # makes an array with a in each column samp <- function(x) {return(sample(x, m, replace=TRUE))} # the wrapper e <- apply(d, 2, samp) # do samp on each column f <- apply(e, 2, sd) # in that case we wanted the means We want to compute a confidence interval on the standard deviation. We now have a vector (f) of 1000 estimates of the standard deviation. To get the upper and lower 2.5% points, what we do is take the empirical percentiles of that vector. Do this (remember "quantile"?). Also make a histogram (with maybe 40 classes) of the distribution of estimates of the standard deviation. 2. If you have time (if ...) you might want to take a sample x from an exponential distribution with expectation 1, and another sample y from an exponential distribution with expectation 2. The function for drawing (say) 200 values from an exponential distribution with expectation 1 is rexp(200, rate=1), for expectation two rexp(200, rate=1/2) Use a permutation test to test whether their means are different. The code I gave to do this was (don't do this immediately!): m <- length(x) n <- length(y) z <- c(x, y ) # concatenate them into one vector w <- sample(z, m+n, replace=FALSE) # permutes that vector v1 <- w[1:m] # the first m v2 <- w[m+1:m+n] # the next n mean(v1)-mean(v2) # difference of means But of course you'll have to do this with something like the above code that we used for bootstrapping. Namely, take your two samples and make a matrix that has z down each column. For this you need to use "rep" to make a vector with 200 x 1000 values in it, namely z repeated end-to-end 1000 times. Then make that into a matrix as done above. This time you can simply apply "sample" to each column (without making a wrapper function, by doing b <- apply(a, 2, sample) % where is the 400 x 1000 matrix (this is possible because the default behavior of "sample" is to sample without replacement a number of times equal to the length of the vector which is being sampled from). To compute the differences of means we will need a wrapper function, fun <- function(x){mean(x[1:100])-mean(x[101:200])} Then we apply it to the columns of the array b. Looking at the quantiles of the resulting vector of sampled differences of means, how can be test whether the difference of means is zero?