Genome 560 Spring 2011 Statistics for Genomicists J. Felsenstein Homework no. 2 This should be turned in by Thursday, May 19 Use R to answer these questions. Cut-and-paste material from your R session into a text file and email it to me (my email address is on my department faculty web page) The homework should be turned in by email, not on paper. 1. An expensive private school also asks for donations. Here are (actual) data (we have anonymized the school) on how many of the parents in each of its graduating classes have donated to the school's fund drive. The school is K-12. These are from the same year (2009) so the 2009 class means current 12-th graders, 2010 means current 11th graders, and so on. Year Donated Total Parents Year Donated Total Parents ---- ------- ------------- ---- ------- ------------- 2009 35 51 2016 19 27 2010 42 56 2017 22 31 2011 39 70 2018 20 30 2012 37 60 2019 17 32 2013 38 53 2020 34 34 2014 35 54 2021 28 32 2015 32 53 a. Do a chi-square analysis (be careful to set up the table correctly) to find out whether there is any sign of parent burnout -- are donations equally likely in all grades? b. Is this to be done as one-tailed or two-tailed? Why? (Does a low chi-square mean a departure from the expected proportions?) c. Also, think of some way to lump parts of the table to make the test focus more on the question at issue, and not waste effort on (say) detecting whether there are differences that do not represent a long term trend. Carry it out and describe the results. Note that you can sum (say) column 2 of rows 1 to 8 of a table by the R command sum(a[1:8,2]) d. What is the effect on the chi-square test, on average, if some of the parents have two (or more) children in more than one grade, and thus are listed as donating (or not) in both of those grades on the basis of the same donation or the same non-donation? e. What will be the effect if some of the parents decide on whether or not to donate too late in the year to be included in the table, and have a different relationship between grade and whether they donate? Explain your thinking about that. 2. A "data frame" named RMA_Filtered.txt is available at the course web page for download. These are gene expression levels for 5194 genes for samples from 8 people of European ancestry and 8 people of African ancestry by Josh Akey and colleagues (Storey et al., American Journal of Human Genetics, 2007). Each individual's values were measured twice, so there are 16 columns for each population. Read in the data, and take values for the first two individuals (both are from the same population). Here's how you do that in R: make sure the file "RMA_Filtered.txt" with the data frame is in the folder from which you are working. Then do: genes <- read.table(file="RMA_Filtered.txt", header=T) # read the frame a <- as.vector(genes[1:5194,2],mode="numeric") # make two vectors b <- as.vector(genes[1:5194,4],mode="numeric") # call them a, b a. Look at the values of the expression levels, their differences, and also of their logarithms. (The differences of their logs, not the logs of the differences, as logs of negative numbers are immoral and illegal). Compare their cumulative distributions to a normal distribution (the functions qnorm and qqplot may be useful for this -- you may have to find out how). Are the values or their differences, or the logs of the values or the differences of the logs normally distributed? b. Do an unpaired t-test for the data, and for the logs of the data. c. Do a paired t-test for these too. d. Why are the results of the tests so different?