Genome 560 Spring 2011
Statistics for Genomicists J. Felsenstein
Homework no. 3
This should be turned in by Tuesday, May 31
Use R to answer these questions. Cut-and-paste material from your R session
into a text file and email it to me. The homework should be turned in by
email, not on paper.
1. Here is a 10 x 2 contingency table, showing 100 people of different
age ranges classified as to whether or not they are sick (this is actually
not real data). We want to test whether the probability of disease
increases linearly with age.
Age Sick Not
1-10 4 96
11-20 3 97
21-30 6 94
31-40 7 93
41-50 8 92
51-60 17 83
61-70 12 88
71-80 18 82
81-90 14 86
91-100 20 80
You do not need to type this table in: it will be found in the "datasets"
folder on the course web page as "health.txt"
Suppose that we express the probability of being sick in each row as a
linear function a + b*r where r is the number of the row (1, 2, 3, ...).
(You can generate the row numbers using the "seq" command).
The probability of the data in row r can be computed using the
"dbinom" function.
1. Figure out how to use this to compute the log-likelihood for the whole
table as a function of a and b. Find the values of a and b that
maximize this log-likelihood (trial and error will work). (If you can
write an R function, that will make life easier).
2. Also compute the log-likelihood for the best value of a when the
slope b is held equal to zero.
3. Use these two log-likelihoods to carry out a likelihood ratio test
of the hypothesis that b = 0. How many degrees of freedom does it have?
2. In lecture I gave an example of Bayesian inference of tossing a coin 11
times and getting 5 heads, with a prior probability on heads which is
a truncated exponential distribution, where the exponential distribution
had a mean of 1. For a similar data set, with the same prior, suppose
that we toss 80 times and get 52 heads.
Use a fine grid of values of p with values from 0.001
to 0.999 (we're avoiding 0 and 1 because of overflow/underflow issues).
You can approximate integration by just summing up probabilities in
intervals in the arrays.
a. Compute the densities of the prior distribution, compute the
likelihood curve, and the posterior distribution. Please don't
send me them (too many numbers!) Instead,
I. ... get the 95% credible interval for the prior and for the
posterior. (The "credible interval" is just the name Bayesians
use for the equivalent of a confidence interval).
II. ... and also the mode (highest point) for the posterior
III. ... and the maximum likelihood estimate too.