Genome 560 Spring 2011
Statistics for Genomicists J. Felsenstein
R exercise 4
Chi-square etc.
In these exercises, look at each result. I've left those steps out.
# We're going to computer-simulate data from a 2 x 2 table that
# does not have the same proportions in all rows.
# Draw 300 draws from four outcomes if probabilities 0.1, 0.2, 0.4, 0.3
a <- sample(c(1,2,3,4), 300, replace=TRUE, prob=c(0.10,0.20,0.40,0.30))
# (this means that "sample" draws 300 times, with replacement, with values
# 1, 2, 3, or 4, and with probabilities 0.10, 0.20, 0.40, 0.30)
# This looks like this:
# 0.1 0.2
# 0.4 0.3
# so the ratios of columns are different in the two rows (1:2, 4:3)
b <- tabulate(a) # make it into a vector of counts
tb <- matrix(b, 2, 2) # then make this into a 2x2 table
chisq.test(tb) # do a 2x2 chi-square test
fisher.test(tb) # is it much different?
# what happens to the above if you do a smaller number of draws?
# You can do that without a lot of retyping by using your up-arrow key
# to scroll back through the commands, then use your left-arrow key to
# go back to a character and use the delete key to delete it and then type
# in what it should be.
# Now we're going to simulate data from a Poisson distribution
d <- rpois(200, 3.0) # 200 draws from a Poisson distribution with mean 3.0
ob <- tabulate(d+1) # count them up (have to add 1 as it ignores zeros!)
# (annoying: must add 1 to each so none are zero!)
q <- dpois(0:(length(ob)-2),3.5) # produce class probabilities for a
# Poisson distribution with one fewer class, with mean
# 3.5 so these are somewhat wrong probabilities as we
# know the expectation is really 3.0
q <- c(q, 1-sum(q)) # Add on the last class after with the rest of the prob.
chisq.test(ob,p=q) # Do a goodness-of-fit chi square test with probs q
ex <- 200*q # get the expected numbers
sum((ob-ex)^2/ex) # that's a way of computing the chi-square by yourself!
# what is it doing? Did it get the right number?
# can you try some other values of the mean of the Poisson (other than 4)
# and do the test for those? Use you up-arrow key to recall previous
# commands and edit them when doing this (there are just 3 commands to
# repeat). What turns out to be the minimum-chi-square estimate of the mean?
# can you also do a plot of the difference between ob and ex, versus
# the class number (which you can generate using a "seq( ... )" command)?
# this would be a reasonable way to diagnose where the lack of fit is.