Genome 560 Spring, 2011
Statistics for Genomicists J. Felsenstein
Exercise #9
We are going to do 5,194 tests and plot their P values. This is for the
data set of Storey, Akey et al., the one called RMA_Filtered.txt
Then we're going to use Storey's method to compute False Discovery Rates.
The data consists of measurements of log gene expression for 5,194 loci
for two populations (European and African). In each population they
looked at 8 individuals and for each of them measured them twice.
We are going to simplify things by pretending that these 16 values (for
each population) are just independent samples. In a more complete
analyses we would have to take the levels of individuals and replicate
measurements into account by an analysis of variance (or some such).
For this exercise we just use a t-test.
The two populations each have 16 values. These are the first 16 columns
and the next 16 columns.
First we have to read in the data frame calling it (say) a:
a <- read.table("RMA_Filtered.txt", header=TRUE )
(that makes sure the headings of columns are dealt with correctly.
b <- a[,2:33]
This copies the numbers out into a 5194 by 32 matrix, with columns
1:32 of the new matrix being the old columns 2:33 (so omitting the locus name)
Now we make up a function "fun" of one argument which is needed to
apply it to each row of that matrix. It returns the P-value for the
two-sample t-test:
fun <- function(d){return(t.test(d[1:16],d[17:32])$p.value)}
Do you see what it does to row i of matrix b? Try
typing "fun(b[10,])" or whatever row you want, to just make sure it
works and gets the P value for that row.
Now we apply it to the rows of the matrix and get the P values in a vector
p <- apply(b, 1, fun)
That may take a minute or two to run.
Now p is the vector of P values.
Now, to see what they look like, plot the histogram of the
p's, using nclass=50 so you can see where the 0.05 cutoff is
(in case you want to use that).
choose a significance level alpha (maybe it would be 0.05, or 0.10,
or whatever. Call it alpha
Find out how many there are:
tabulate(as.numeric(p < alpha)) # tells you how many there are
Look at a histogram of P values. What would be a good
cutoff for Storey's method? Does 0.5 look like a good one?
Call the desired cutoff value F and set that variable.
To do John Storey's method, find out how many are above F:
tabulate(as.numeric(p > F))
Now you should be able to calculate the FDR for that cutoff alpha
(just use the calculator capabilites of R). See the description
of the calculation in the lecture and do the proper projection
to figure out from the number above F (out of 5194) what number
to expect below alpha that will represent false significance.
A reminder: first figure out how many are above some value such
as 0.5. From that figure how many to expect to be below your
cutoff (such as P=0.1) which will actually be false. Take the
rest that are below your cutoff: those may well be true. Now
calculate the FDR from using that cutoff.
Try some different values of alpha to see what we estimate the
FDR is. You can use the same vector of p's for that. If we were trying to
figure out which loci showed significant differences between the two
populations, which value of alpha would we be happy with? If you're
adventurous, try to see if you can figure out how to tabulate how many of the
ones that were below the cutoff actually were from the null hypothesis (you
have all the information).