Genome 560 Spring 2011
Statistics for Genomicists J. Felsenstein
Homework no. 4
This should be turned in by the end of Thursday, June 2. Please do
this on time if you can.
(Fear not -- I have provided the R code this time.)
I have provided a downloadable subset of Josh Akey's gene expression
data. They used gene expression arrays on cell lines for 8
individuals per population from two populations, one European and one
African, and measured each gene's expression twice.
(Storey et al., American Journal of Human Genetics, 2007,
which can be downloaded from http://www.genomine.org/publications.html )
When I tried to do an ANOVA on their full data the R system ran out of
memory. So I have provided a 100-gene subset of their data, in a data
frame called genes100.txt which you can download at the course
web site. Read it in (let's say it is then called jd ):
jd <- read.table("genes100.txt",header=TRUE)
The variables are named genes, pops, indivs, and
x . With 2 populations, 8 individuals each, 2 measurements each
and 100 genes there are 3200 observations.
* Do a three-way analysis of variance on the values of the
measurement x. Note that individuals are limited to single
populations, so we have to be a bit subtle about which interactions
to allow: we allow main effects (G, P and I) and interactions GxP,
GxI but not PxI (as no individual lives in more than one population).
Likewise the three-way interaction GxPxI is meaningless and absent.
aa <- lm(x~factor(genes)+factor(pops)+factor(indivs)+factor(genes)*factor(pops)+factor(genes)*factor(pops)+factor(genes)*factor(indivs),data=jd)
(Actually, the linear model machinery is very smart: if you just
naively type
aa <- lm(x~factor(genes)*factor(pops)*factor(indivs),data=jd)
the lm function figures out that individuals do not exist in more than
one population, and it never tries to fit the interaction of pops with
indivs, or the three-way interaction. So you can type either, you get
the same result.)
The analysis of variance may still take some minutes on your machine.
When the above statement finishes, look at the resulting ANOVA table
by using the command:
anova(aa)
* Show me the ANOVA table, and interpret it. Each F value is the ratio
of two mean squares. Explain for each one which two that is (you
will be able to confirm that by doing the ratio yourself).
For each of these F tests, what question is it answering?
Explain why the degrees of freedom for each mean square are what they are.
* Plot the values of x as functions of other variables
by doing commands like:
boxplot(jd$x~factor(jd$pops),data=jd)}
Does this help? Do one of these for each of the three explanatory
variables (genes, pops, and indivs). What do they tell you and why?
* Does the result change much if you analyze the log of x?
The square root of x? These can be done by replacing
x by, say, log(x) in the lm( ... ) command.
Note -- if your machine has too little memory to run these
tests, you could subset the data by copying it: jd2 <- jd[1:1600,]
(which would get you 50 genes instead of 100) and that might be analyzable.