Genome 560 Spring, 2011
Statistics for Genomicists J. Felsenstein
Exercise #6
We are going to do three analyses of variance, two one-way and
one a two-way analysis of variance. We use the {\tt lm} command
which fits a (generalized) "linear model" and then the {\tt anova}
command which presents the analysis of variance table.
In each case see if you can figure out what the analysis of
variance table means, what tests what, and why the number of
degrees of freedom is what it is.
Case 1. I have made a data frame called "example6.txt" which you will find
on the web site. (It is artificial data). Download it, save it to
the folder (directory) in which you will be working for this exercise.
It has three columns. The first is the observation number.
The second is the group number (grp), the third is the observation (x).
These variable names are known to the program after the data frame
is read in.
Do a one-way ANOVA. You need to read the table using
ff <- read.table("example6.txt") # the usual command
# and then use "lm" and "anova" commands:
mm <- lm(x ~ factor(grp), data=ff) # checks effect of variable "grp"
# which is a "factor" (not a number)
anova(mm) # reports the ANOVA table
# Try to figure out the ANOVA table.
mm # If you just print "mm" what does it show?
Case 2. These are (real) measurements of adductor muscle scar
length on shells of mussels collected at a number of localities.
A one-way ANOVA tests whether those localities differ.
The numerical value of variable V1 is to be used as a "factor",
so that it simply indicates group membership and is not
a numerical value of interest.
gg <- read.table("mussels.txt") # read the data frame (and also
# be sure to take a look at it)
m2 <- lm(V2 ~ V1, data=gg) # fit the linear model
# It knows to use V1 as a "factor"
# but you could use V2 ~ factor(V1) too
anova(m2) # print the ANOVA table
m2 # also the estimated effects
Do the localities differ? What accounts for the degrees of freedom?
Do we need to have equal numbers of data points in each group?
Case 3. Now for a two-way table. Mannose phosphate
isomerase enzyme had its enzyme activity measured in
some fish of three genotypes and two sexes, in multiple
fish of the same species. (The genotypes ff, fs, and ss
refer to the two alleles at the locus, which in old-fashioned
protein electrophoresis were the "fast" and "slow" alleles
which refers to where they move on the gel and not to enzyme
activity). We are going to do a two-way analysis of variance
where there the "rows" might be the sexes and the "columns"
genotypes (so a 2 x 3 table) with multiple data points in
each cell.
hh <- read.table("mpi.txt") # again, take a look at it afterwards
m3 <- lm(V4 ~ V2*V3, data=hh) # fit the linear model
# Once again, it does not need factor(V2)
# or factor(V3) as it can see they are not
# numbers. Or you could use "factor", e.g.:
# m3 <- lm(V4 ~ factor(V2)+factor(V3)+factor(V2)*factor(V3), data=hh)
anova(m3) # do sexes, genotypes matter?
# what does an interaction mean?
# what accounts for the degrees of freedom?
Case 4. If you have time, here we computer-simulate a set of
data in a two-way table where we choose the variances of the various effects.
After that we want to make up our own 2-way table, with 5 rows and
4 columns, with 6 obervations in each cell. Think up an overall
expectation called mu and assign it a value. Then assign
a variance component for rows (vr), a number that you choose, and the variance
component for columns (vc) and the variance component for cells (vij)
as well as the error variance (ve). You get to decide on these
values -- maybe you want one or more of these to have variance 0.
Now just do the following to make the columns of the data table
and then the data frame itself (don't ask why):
col <- rep(as.vector(seq(1,4) %o% rep(1,5)),6) # setting up the row ...
row <- rep(as.vector(rep(1,4) %o% seq(1,5)),6) # ... and column indices
roweffects <- rnorm(5,0,sd=sqrt(vr)) # 5 row effects
coleffects <- rnorm(4,0,sd=sqrt(vc)) # 4 column effects
celleffects <- rnorm(20, 0, sd=sqrt(vij)) # 20 cell effects
Y <- rnorm(120,0,sd=sqrt(ve)) + roweffects[row] + coleffects[col] + celleffects
# A good test of your understanding of R: what is that doing?
ff <- data.frame(row, col, Y) # make a frame of these three
(if you get impatient, after setting the values of vr, vc. vij, and ve
and then just copy-and-paste these commands).
Now try out a two-way analysis of variance. The model
is written Y ~ factor(row)*factor(col)
(can you figure out how to write the commands "lm" and "anova" with this?)
which is a short-hand for a two-way design with effects
for "row" for "col" and for their interaction, as well
as an error term.
Does the analysis find significance where there should be and
not where there shouldn't? Can you figure out why each of the
variance components has the degrees of freedom that it does?