Genome 560 Spring 2011
Statistics for Genomicists J. Felsenstein
Homework no. 1
This should be turned in by Thursday, May 12
Use R to answer these questions. I do not know (I am embarassed to say) how
give an R command to record an R session, so you may have to simply
cut-and-paste material from your R session into a text file and email it to me
(my email address is on my department faculty web page) The homework should be
turned in by email, not on paper. (On some systems that have a File menu for
R at the top of the screen there is a Save As command there that can save the
whole session, I am told).
(1) Using R, read in the "genes1000" data set, which has replicate measurements
of gene expression levels (log-transformed) from one individual for 1000
different genes. This can be done by the commands:
x <- read.table("genes1000")
(The result is a "data frame" that has three columns, the gene name, the first
measurement for that gene, and the second measurement for that gene. There is
also an extra row at the top of the table with the variable names. The values
in the table may be character-strings instead of numbers. On some Windows
system the file name may be "genes1000.txt")
(2) Make an array of all 1000 values for the first measurement by the command
a <- as.numeric(as.matrix(x[2:1001,2]))
(3) Use the "range", "quantile", "mean", "var", and "sd" command to obtain
these descriptive statistics. Report them. For quantiles I want the 2.5%,
5%, 25%, 50%, 75%, 95%, and 97.5% points. The lower 25% and the upper 25%
point in a Normal distribution will be -0.67448 and 0.67488 standard
deviations from the mean. Is this approximately true?
(4) Use the "hist" command with 50 classes to examine the distribution
of the points. (You may be able to save the histogram as either a JPG or
a PDF file and attach it with your email of the results).
(5) Now we want to examine means of different numbers of genes. We will use
the genes in the order in which they were given in the data file. To make the
1000 points of vector "a" into 500 means of consecutive pairs of points, the
(tricky) way is to reformat it as a 500-row by 2-column two-dimensional array
with the command
b2 <- matrix(a, 500, 2)
and then apply the function "mean" to this, computing the mean for each row:
d2 <- apply(b2,1,mean)
The result is a vector of 500 averages, each an average of two successive
genes. Compute the mean, variance, and standard deviation of this array
(which has means of pairs). See whether the lower and upper 25% points are
-0.67488 and 0.67488 standard deviations from the mean.
(6) Use commands analogous to these to do the same for 250 means of 4, and 100
means of 10. For each compute the mean, variance, and standard deviation.
Also look at the lower and upper 25% quantiles, and ask whether they are
-0.67488 and 0.67488 standard deviations from the mean. How do the variances
and standard deviations change with the number of genes being averaged? Does
this happen in the way predicted for independent samples from a distribution?
Do the resulting histograms look more normal as we average more values?
(7) Finally, do all this again, but first take the square function a*a of the
1000 data points. Using the histogram command, see whether the resulting
distributions are approximately normal. Report how the means, variances and
standard deviations change as we take averages (means) of 4 or of 10 points.
Are the upper 2.5% points 1.95996 standard deviations above the mean? Do the
distributions look more normal the more values are averaged?