Genome 560 Spring 2011 Statistics for Genomicists J. Felsenstein Homework no. 3 This should be turned in by Tuesday, May 31 Use R to answer these questions. Cut-and-paste material from your R session into a text file and email it to me. The homework should be turned in by email, not on paper. 1. Here is a 10 x 2 contingency table, showing 100 people of different age ranges classified as to whether or not they are sick (this is actually not real data). We want to test whether the probability of disease increases linearly with age. Age Sick Not 1-10 4 96 11-20 3 97 21-30 6 94 31-40 7 93 41-50 8 92 51-60 17 83 61-70 12 88 71-80 18 82 81-90 14 86 91-100 20 80 You do not need to type this table in: it will be found in the "datasets" folder on the course web page as "health.txt" Suppose that we express the probability of being sick in each row as a linear function a + b*r where r is the number of the row (1, 2, 3, ...). (You can generate the row numbers using the "seq" command). The probability of the data in row r can be computed using the "dbinom" function. 1. Figure out how to use this to compute the log-likelihood for the whole table as a function of a and b. Find the values of a and b that maximize this log-likelihood (trial and error will work). (If you can write an R function, that will make life easier). 2. Also compute the log-likelihood for the best value of a when the slope b is held equal to zero. 3. Use these two log-likelihoods to carry out a likelihood ratio test of the hypothesis that b = 0. How many degrees of freedom does it have? 2. In lecture I gave an example of Bayesian inference of tossing a coin 11 times and getting 5 heads, with a prior probability on heads which is a truncated exponential distribution, where the exponential distribution had a mean of 1. For a similar data set, with the same prior, suppose that we toss 80 times and get 52 heads. Use a fine grid of values of p with values from 0.001 to 0.999 (we're avoiding 0 and 1 because of overflow/underflow issues). You can approximate integration by just summing up probabilities in intervals in the arrays. a. Compute the densities of the prior distribution, compute the likelihood curve, and the posterior distribution. Please don't send me them (too many numbers!) Instead, I. ... get the 95% credible interval for the prior and for the posterior. (The "credible interval" is just the name Bayesians use for the equivalent of a confidence interval). II. ... and also the mode (highest point) for the posterior III. ... and the maximum likelihood estimate too.