Genome 560 Spring, 2011 Statistics for Genomicists J. Felsenstein Exercise #9 We are going to do 5,194 tests and plot their P values. This is for the data set of Storey, Akey et al., the one called RMA_Filtered.txt Then we're going to use Storey's method to compute False Discovery Rates. The data consists of measurements of log gene expression for 5,194 loci for two populations (European and African). In each population they looked at 8 individuals and for each of them measured them twice. We are going to simplify things by pretending that these 16 values (for each population) are just independent samples. In a more complete analyses we would have to take the levels of individuals and replicate measurements into account by an analysis of variance (or some such). For this exercise we just use a t-test. The two populations each have 16 values. These are the first 16 columns and the next 16 columns. First we have to read in the data frame calling it (say) a: a <- read.table("RMA_Filtered.txt", header=TRUE ) (that makes sure the headings of columns are dealt with correctly. b <- a[,2:33] This copies the numbers out into a 5194 by 32 matrix, with columns 1:32 of the new matrix being the old columns 2:33 (so omitting the locus name) Now we make up a function "fun" of one argument which is needed to apply it to each row of that matrix. It returns the P-value for the two-sample t-test: fun <- function(d){return(t.test(d[1:16],d[17:32])\$p.value)} Do you see what it does to row i of matrix b? Try typing "fun(b[10,])" or whatever row you want, to just make sure it works and gets the P value for that row. Now we apply it to the rows of the matrix and get the P values in a vector p <- apply(b, 1, fun) That may take a minute or two to run. Now p is the vector of P values. Now, to see what they look like, plot the histogram of the p's, using nclass=50 so you can see where the 0.05 cutoff is (in case you want to use that). choose a significance level alpha (maybe it would be 0.05, or 0.10, or whatever. Call it alpha Find out how many there are: tabulate(as.numeric(p < alpha)) # tells you how many there are Look at a histogram of P values. What would be a good cutoff for Storey's method? Does 0.5 look like a good one? Call the desired cutoff value F and set that variable. To do John Storey's method, find out how many are above F: tabulate(as.numeric(p > F)) Now you should be able to calculate the FDR for that cutoff alpha (just use the calculator capabilites of R). See the description of the calculation in the lecture and do the proper projection to figure out from the number above F (out of 5194) what number to expect below alpha that will represent false significance. A reminder: first figure out how many are above some value such as 0.5. From that figure how many to expect to be below your cutoff (such as P=0.1) which will actually be false. Take the rest that are below your cutoff: those may well be true. Now calculate the FDR from using that cutoff. Try some different values of alpha to see what we estimate the FDR is. You can use the same vector of p's for that. If we were trying to figure out which loci showed significant differences between the two populations, which value of alpha would we be happy with? If you're adventurous, try to see if you can figure out how to tabulate how many of the ones that were below the cutoff actually were from the null hypothesis (you have all the information).