Genome 560 Spring, 2011 Statistics for Genomicists J. Felsenstein ` Exercise 7 This exercise uses a population genetics data set as an example of maximum likelihood estimation. This case can also be treated in other ways, as we will note. It will also serve as some familiarization with how you write functions of variables in R. Suppose that there is a disease (let's say halitosis) which is partly genetically determined. The genotype aa has a 40% chance of getting the disease, and the other two possible genotypes, AA and Aa, each have a 10% chance of getting the disease. We want to estimate the frequency of the A allele, and put a rough confidence interval on it. If the gene frequency of the A allele is p, and that of a is 1-p, then the frequency of the disease is expected to be (if the genotypes are in Hardy-Weinberg proportions as a result of random mating: F = p^2 (0.1) + 2p(1-p) (0.1) + (1-p)^2 (0.4) (where the notation "^2" means "squared"). Now suppose we observe 1000 indviduals and find that the 182 of them have the disease. Using a binomial distribution, the probability that 182 out of 1000 have the disease is the binomial class probability for 182 out of 1000 when the probability of the event is F (which is a function of p). This is 1000! -------- F^(182) (1-F)^(818) 182! 818! we want to maximize this by varying p. The first thing to notice is that the factorial stuff in front doesn't matter -- it will be the same for all values of F, and it will cancel out in any calculation of the ratios of likelihoods. So we drop it. Then the log-likelihood is 182 log(F) + 818 log(1-F) We want to write a function LL, the log-likelihood for p. For this we create two functions (just copy them but look at them too and try to figure out what they do, as this is how you make your own functions): F <- function(p) { return(0.1*(p^2+2*p*(1-p)) + 0.4*(1-p)^2) } LL <- function(f) { return(182*log(f)+818*log(1-f)) } Note -- more generally you put statements inside the parentheses that define the body of the function, and these statements are separated by semicolons. The last one should be return(...) which computes the value of the value of the function. A function can also have multiple arguments such as LL(a, b, x) Enter these commands. Then 1. Calculate LL(F(p)) for some values of p. 2. To make life easier, using commands like seq( ... ) to make a vector (called, say, x) of gene frequencies from just above zero to nearly 1 (don't use 0 or 1 as the logarithms will explode, which is scary). 3. Make a vector of log-likelihood values by doing LL(F(x)). 4. Plot this against x 5. Repeat commands 2 and 3 to narrow down the search, using seq(...) in appropriate ways. Look its syntax up using ?seq 6. A Likelihood Ratio Test will reject a suggested value of p if it makes the log likelihood 1.92 smaller (i.e. twice the difference in log-likelihoods is the significant chi-square for one degree of freedom, 3.841). Can you come up with a way to use seq(...) and the functions LL and f to see which values of p are not going to be rejected? This is an (approximate) likelihood-based confidence interval. 7. (Extra credit, so to speak) Suppose you could use the function LL and a vector of values of f (not p). Could you find the values of f that were in the confidence interval on f? Do those exactly correspond to the values you found for p?