Genome 560 Spring 2011
Statistics for Genomicists J. Felsenstein
Homework no. 2
This should be turned in by Thursday, May 19
Use R to answer these questions. Cut-and-paste material from your R session
into a text file and email it to me (my email address is on my department
faculty web page) The homework should be turned in by email, not on paper.
1. An expensive private school also asks for donations. Here are (actual)
data (we have anonymized the school) on how many of the parents in
each of its graduating classes have donated to the school's fund drive.
The school is K-12. These are from the same year (2009) so the 2009 class
means current 12-th graders, 2010 means current 11th graders, and so on.
Year Donated Total Parents Year Donated Total Parents
---- ------- ------------- ---- ------- -------------
2009 35 51 2016 19 27
2010 42 56 2017 22 31
2011 39 70 2018 20 30
2012 37 60 2019 17 32
2013 38 53 2020 34 34
2014 35 54 2021 28 32
2015 32 53
a. Do a chi-square analysis (be careful to set up the table correctly)
to find out whether there is any sign of parent burnout -- are donations
equally likely in all grades?
b. Is this to be done as one-tailed or two-tailed? Why? (Does a
low chi-square mean a departure from the expected proportions?)
c. Also, think of some way to lump parts of the table to make the
test focus more on the question at issue, and not waste effort on
(say) detecting whether there are differences that do not represent
a long term trend. Carry it out and describe the results.
Note that you can sum (say) column 2 of rows 1 to 8 of a table by
the R command sum(a[1:8,2])
d. What is the effect on the chi-square test, on average, if some
of the parents have two (or more) children in more than one grade,
and thus are listed as donating (or not) in both of those grades on the
basis of the same donation or the same non-donation?
e. What will be the effect if some of the parents decide on
whether or not to donate too late in the year to be included in
the table, and have a different relationship between grade and
whether they donate? Explain your thinking about that.
2. A "data frame" named RMA_Filtered.txt is available at
the course web page for download. These are gene expression levels
for 5194 genes for samples from 8 people of European ancestry and
8 people of African ancestry by Josh Akey and colleagues (Storey
et al., American Journal of Human Genetics, 2007). Each individual's
values were measured twice, so there are 16 columns for each population.
Read in the data, and take values for the first two individuals (both are
from the same population). Here's how you do that in R: make sure the file
"RMA_Filtered.txt" with the data frame is in the folder from which you are
working. Then do:
genes <- read.table(file="RMA_Filtered.txt", header=T) # read the frame
a <- as.vector(genes[1:5194,2],mode="numeric") # make two vectors
b <- as.vector(genes[1:5194,4],mode="numeric") # call them a, b
a. Look at the values of the expression levels, their differences, and also
of their logarithms. (The differences of their logs, not the logs of the
differences, as logs of negative numbers are immoral and illegal).
Compare their cumulative distributions to a normal distribution (the
functions qnorm and qqplot may be useful for this -- you may have to
find out how). Are the values or their differences, or the logs of the
values or the differences of the logs normally distributed?
b. Do an unpaired t-test for the data, and for the logs of the data.
c. Do a paired t-test for these too.
d. Why are the results of the tests so different?