Genome 560 Spring, 2011
Statistics for Genomicists J. Felsenstein
Exercise #10
In 1970 I was in charge of a small department reading room that
subscribed to 35 journals. As prices had risen, I had to survey 14 faculty
members of the Genetics Department as to which journals they most wanted to
keep, with scores of 1-5, 1 being highest enthusiasm.
In file "journals.txt" you will find a data frame with the answers (I
hope I'm not violating people's privacy too much). Names of people and
abbreviations of journals are visible.
Let's do a Principal Components analysis. (I did a similar analysis and
presented it to the Department as a special Journal Club on "The
intellectual structure of the department").
# Read in the data frame
journals <- read.table("journals.txt", header=TRUE) # then take a look at it, too
# Compute row means (average score for each person)
pmeans <- rowSums(journals)/35 # as there were 35 journals
# get deviation of each score from the person's mean
# which is a special correction we make for this data set
# as we want to correct for varying degrees of overall enthusiasm of people
pdevs <- journals-pmeans # (it repeats the means as needed)
# now compute the mean for each column(each journal)
jmeans <- colSums(pdevs)/14
# Now get the deviation of each value from its column mean. (This is
# somewhat arcane but trust me).
jdevs <- t(t(pdevs)-jmeans)
# get the 35x35 matrix of covariances of journals
jcovs <- cov(jdevs) # it considers rows to be the variables
# get the Singular Value Decomposition of the covariance matrix
# which is an object containing the eigenvalues jsvd$d as well as
jsvd <- svd(jcovs) # left eigenvectors (u), right (v)
# now look at the eigenvalues for this: How many nonzero components are
# there? How much of the variation is from the first few principal components?
jsvd$d # they are in order by size (variance explained by each)
# the coordinates of the journals on PC1 and PC2
x <- sqrt(jsvd$d[1])*jsvd$u[,1] # PC1 for journals
y <- sqrt(jsvd$d[2])*jsvd$u[,2] # PC2 for journals
# Let's plot PC1 versus PC2 for the journals
# we want to plot the names instead of points, to aid interpretation
# we also make sure the aspect ratio is 1 so scales of PC1 and PC2 are
# comparable. So first we set up an empty plot with the right aspect ratio
plot(x,y,type="n",asp=1) # this opens the plot but plots nothing ("n")
text(x,y,labels=colnames(journals)) # plots the labels at those points
# Now we can get the principal components of the people too.
# We take the V matrix from the singular value decomposition and multiply
# it by the deviations (I know this isn't obvious!).
w <- jsvd$v[,1] %*% t(jdevs)
z <- jsvd$v[,2] %*% t(jdevs)
plot(w,z,type="n",asp=1) # this opens the plot but plots nothing ("n")
text(w,z,labels=rownames(journals)) # plots the journal labels at those points
Interpret the journals PC1 and PC2. Are microbial and molecular journals
on one end of some axis? Where are general science journals such
and Science, Nature, and PNAS? Where are evolutionary journals
such as Evolution, American Naturalist? Where are human genetics
journals such as American Journal of Human Genetics?
What about PC3? PC4? Does anything helpful emerge if you
plot (say) PC1 versus PC3?
Warning -- I am not entirely sure that the methods here are correct. There
is also a principal components package called princomp. You can
do:
pc <- princomp(t(journals))
and look at pc$sdevs and pc$loadings
You might also want to read in the gene expression array data set
"RMA_Filtered.txt" and try to do Principal Components on it.
In the plotting of PC1 versus PC2 for samples, do the European
and African samples separate into different clusters? (I haven't
yet tried this so I can only guess).