Genome 560 Spring 2011 Statistics for Genomicists J. Felsenstein Homework no. 1 This should be turned in by Thursday, May 12 Use R to answer these questions. I do not know (I am embarassed to say) how give an R command to record an R session, so you may have to simply cut-and-paste material from your R session into a text file and email it to me (my email address is on my department faculty web page) The homework should be turned in by email, not on paper. (On some systems that have a File menu for R at the top of the screen there is a Save As command there that can save the whole session, I am told). (1) Using R, read in the "genes1000" data set, which has replicate measurements of gene expression levels (log-transformed) from one individual for 1000 different genes. This can be done by the commands: x <- read.table("genes1000") (The result is a "data frame" that has three columns, the gene name, the first measurement for that gene, and the second measurement for that gene. There is also an extra row at the top of the table with the variable names. The values in the table may be character-strings instead of numbers. On some Windows system the file name may be "genes1000.txt") (2) Make an array of all 1000 values for the first measurement by the command a <- as.numeric(as.matrix(x[2:1001,2])) (3) Use the "range", "quantile", "mean", "var", and "sd" command to obtain these descriptive statistics. Report them. For quantiles I want the 2.5%, 5%, 25%, 50%, 75%, 95%, and 97.5% points. The lower 25% and the upper 25% point in a Normal distribution will be -0.67448 and 0.67488 standard deviations from the mean. Is this approximately true? (4) Use the "hist" command with 50 classes to examine the distribution of the points. (You may be able to save the histogram as either a JPG or a PDF file and attach it with your email of the results). (5) Now we want to examine means of different numbers of genes. We will use the genes in the order in which they were given in the data file. To make the 1000 points of vector "a" into 500 means of consecutive pairs of points, the (tricky) way is to reformat it as a 500-row by 2-column two-dimensional array with the command b2 <- matrix(a, 500, 2) and then apply the function "mean" to this, computing the mean for each row: d2 <- apply(b2,1,mean) The result is a vector of 500 averages, each an average of two successive genes. Compute the mean, variance, and standard deviation of this array (which has means of pairs). See whether the lower and upper 25% points are -0.67488 and 0.67488 standard deviations from the mean. (6) Use commands analogous to these to do the same for 250 means of 4, and 100 means of 10. For each compute the mean, variance, and standard deviation. Also look at the lower and upper 25% quantiles, and ask whether they are -0.67488 and 0.67488 standard deviations from the mean. How do the variances and standard deviations change with the number of genes being averaged? Does this happen in the way predicted for independent samples from a distribution? Do the resulting histograms look more normal as we average more values? (7) Finally, do all this again, but first take the square function a*a of the 1000 data points. Using the histogram command, see whether the resulting distributions are approximately normal. Report how the means, variances and standard deviations change as we take averages (means) of 4 or of 10 points. Are the upper 2.5% points 1.95996 standard deviations above the mean? Do the distributions look more normal the more values are averaged?