Genome 560 Spring, 2011
Statistics for Genomicists J. Felsenstein
`
Exercise 7
This exercise uses a population genetics data set as an example of
maximum likelihood estimation. This case can also be treated in
other ways, as we will note. It will also serve as some familiarization
with how you write functions of variables in R.
Suppose that there is a disease (let's say halitosis) which is partly
genetically determined. The genotype aa has a 40% chance of getting the
disease, and the other two possible genotypes, AA and Aa, each have a 10%
chance of getting the disease. We want to estimate the frequency of the A
allele, and put a rough confidence interval on it.
If the gene frequency of the A allele is p, and that of a is 1-p,
then the frequency of the disease is expected to be (if the genotypes
are in Hardy-Weinberg proportions as a result of random mating:
F = p^2 (0.1) + 2p(1-p) (0.1) + (1-p)^2 (0.4)
(where the notation "^2" means "squared"). Now suppose we observe
1000 indviduals and find that the 182 of them have the disease.
Using a binomial distribution, the probability that 182 out of 1000
have the disease is the binomial class probability for 182 out of 1000
when the probability of the event is F (which is a function of p).
This is
1000!
-------- F^(182) (1-F)^(818)
182! 818!
we want to maximize this by varying p. The first thing to notice
is that the factorial stuff in front doesn't matter -- it will be the
same for all values of F, and it will cancel out in any calculation
of the ratios of likelihoods. So we drop it.
Then the log-likelihood is 182 log(F) + 818 log(1-F)
We want to write a function LL, the log-likelihood for p.
For this we create two functions (just copy them but look at them too
and try to figure out what they do, as this is how you make your own
functions):
F <- function(p) {
return(0.1*(p^2+2*p*(1-p)) + 0.4*(1-p)^2)
}
LL <- function(f) {
return(182*log(f)+818*log(1-f))
}
Note -- more generally you put statements inside the parentheses that define
the body of the function, and these statements are separated by semicolons.
The last one should be return(...) which computes the value of the
value of the function. A function can also have multiple arguments such
as LL(a, b, x)
Enter these commands. Then
1. Calculate LL(F(p)) for some values of p.
2. To make life easier, using commands like seq( ... ) to make a vector
(called, say, x) of gene frequencies from just above zero to nearly 1 (don't
use 0 or 1 as the logarithms will explode, which is scary).
3. Make a vector of log-likelihood values by doing LL(F(x)).
4. Plot this against x
5. Repeat commands 2 and 3 to narrow down the search, using seq(...)
in appropriate ways. Look its syntax up using ?seq
6. A Likelihood Ratio Test will reject a suggested value of p if it
makes the log likelihood 1.92 smaller (i.e. twice the difference in
log-likelihoods is the significant chi-square for one degree of freedom,
3.841). Can you come up with a way to use seq(...) and the functions
LL and f to see which values of p are not going to be rejected?
This is an (approximate) likelihood-based confidence interval.
7. (Extra credit, so to speak) Suppose you could use the function LL
and a vector of values of f (not p). Could you find the values of
f that were in the confidence interval on f? Do those exactly
correspond to the values you found for p?