Computer exercise in R for Lecture 4
THE DATA AND THE TREE
I have provided two 100-species data sets in an R-compatible format. Both
are derived from the same simulation which evolves 10 landmarks (hence 20
coordinates) in a tree of 100 species, with size and shape changing. The
species are named A, AA, AB, B, BA, BB, ... with 100 species in all. The
names are in random order but correspond to the names at the tips of the tree.
They have been simulated evolving on a 100-species tree which itself was
generated by randomly splitting lineages (a "pure birth process" or "Yule
process").
I have provided the coordinates in two forms:
(1) the coordinates in their original form, correctly superposed, but as forms
that have size differences as well as shape differences, and
(2) the coordinates after using a Generalized Procrustes Analysis to superpose
them and rescale them to a common size.
These files are in your Catalyst area, and also linked to on the course
webpage. Their names are:
fishes1intree.txt The tree file, in Newick format
fishes1.txt The forms, in correct superposition
fishes1proc.txt The shapes after Procrustes superpositon
THE ANALYSIS
The exercise will be to read these in, do a phylogenetic comparative methods
analysis on them, using the "pic" command from the "ape" R package, and get
a covariance matrix of variation of the contrasts, and also one from the raw
data (the 100 species) without correction for the phylogeny. The eigenvectors
(particularly the first and second ones) may be compared.
Yes, I know Fred's telling you that principal components of a covariance
matrix are meaningless -- this is the opportunity to get Joe and Fred arguing
with each other.
THE PROGRAMS AND SOME DOCUMENTATION
There is some good online documentation on how to do phylogeny methods in R.
Here are links to some. They are far more detailed than we will need.
1. A CRAN Taks View of "Phylogenetics, especially comparative methods" by the
knowledgeable Brian O'Meara:
https://cran.r-project.org/web/views/Phylogenetics.html
2. The documentation file for the "ape" package which we will use:
https://cran.r-project.org/web/packages/ape/ape.pdf
3. Marguerite Butler, Brian O'Meara, and Jason Pienaar's introduction to
"Comparative Methods and Data Analysis in R":
http://www2.hawaii.edu/~mbutler/Rquickstart/Rcomparative.pdf
SOME HINTS
1. Load the "ape" phylogeny package into your R environment with the
command
library(ape)
2. Read in the tree with the "ape" command
fishes1tree <- read.tree(file = "fishes1intree.txt")
3. A data set can be read in using the R command
fishes1data <- read.table("fishes1.txt")
(or the corresponding command for the Procrustes-superposed data)
4. A matrix with 20 columns can be obtained from that by the R command
fishes1data2 <- data.matrix(fishes1data[,2:21])
which uses all but the first column of the data table.
5. The vector of names which the "pic" command needs can be obtained from
the first column of the data table:
fishes1names <- data.matrix(fishes1data[,1])
6. We can obtain the contrasts, using the tree, for a column of the data
matrix, say column 8, with the commands
X <- fishes1data2[,8]
and being careful to associate our vector of names with that vector:
names(X) <- fishes1names
Then the command to get the contrasts is simply
Xc <- pic(X, fishes1intree)
The contrasts are in "postorder" order, left to right and top to bottom
on the tree. In the array of contrasts values, for some reason they are
numbered 101 to 199. I'm not sure why "pic" does that.
7. You can do this on each of the 20 columns, and associate these contrasts
as columns of a 99 x 20 matrix. Here is how to put the first two together
to make a 99x2 matrix:
Zc <- pic(X, fishes1tree);
Y <- fishes1data2[,2];
names(Y) <- fishes1names;
Yc <- pic(Y, fishes1tree);
Zc <- cbind(Xc, Yc);
But how to bind all 20 columns of contrasts into a single matrix? You need
to write a loop to do that.
8. You can get a 99x20 matrix of contrasts this way. You can also just take
the 100x20 matrix of the raw data which is in "fishes1data2".
9. Now how to get the eigenvalues and eigenvectors? Explore the R commands
"cov" and "eigen", or else "svd"
10. You may find it instructive to plot two columns of the data matrix against
each other, and compare this to the plot of the same two columns of the
contrasts matrix. Are the P values in the regression the same?
11. If you want to plot the fishes along with arrows that indicate where a
particular principal component is on the form, you can use this function
(cut-and-paste it into your R environment):
plotformsandpc <- function(a, n, pc) {
indices <- sample(1:nrow(a),n,replace=FALSE); # forms to be plotted
xmin <- min(a[indices,seq(1,ncol(a)-1,2)]);
xmax <- max(a[indices,seq(1,ncol(a)-1,2)]);
ymin <- min(a[indices,seq(2,ncol(a),2)]);
ymax <- max(a[indices,seq(2,ncol(a),2)]);
first <- TRUE;
for (i in indices) { # plot each specimen
x <- a[i,seq(1,ncol(a)-1,2)];
y <- a[i,seq(2,ncol(a),2)];
if (first) { # plot the first one
plot(x,y,type="l",xlim=c(xmin,xmax),ylim=c(ymin,ymax),asp=1,col="gray",lwd=2);
first=FALSE;
}
else {
lines(x,y,col="gray",lwd=2); # add in the others
}
# close the outline by drawing the last line
lines(c(a[i,ncol(a)-1],a[i,1]),c(a[i,ncol(a)],a[i,2]),col="gray",lwd=2);
}
for (j in seq(1,ncol(a)-1,2)) { # plot pc at mean for each landmark
wherex <- mean(a[indices,j]);
wherey <- mean(a[indices,j+1]);
arrows(wherex-pc[j],wherey-pc[j+1],wherex+pc[j],wherey+pc[j+1],
col="red",length=0.1,lwd=4);
}
}
To use it, the three arguments are the 100x20 array of coordinates, the number
of randomly-chosen specimens to plot, and a vector of the 20 coefficients of a
principal component. The forms are plotted in gray, and the principal
component in red. To make the principal component larger or smaller, just
multiply the argumnent "pc" by a number such as 2 or 0.3.