Genome 560

Statistical Genomics (actually it should be called Statistics for Genomicists)

Spring, 2011

Table of contents:
What this course is and isn't
News about the course
A rough syllabus (to be improved)
Lecture PDFs
Audio recordings of lectures
Books
The R language
R in this course (and R exercises to be done in class)
Homework assignments
The lecturer


What this course is and isn't

I inherited this course, and I'm not entirely sure what it is supposed to be. Its title implies that it will tell you about advanced statistical methods in genomics. It won't. It will not tell you about:

and all sorts of other advanced genomic topics.

What is does seem to be is a statistics refresher, for genomics students who have had a “cookbook” statistics course. For those who have not had any statistics, it is a minimal introduction. But I do not think that it can be considered an adequate substitute for a real statistics course. Genome Sciences graduate students ought to have a full statistics course.

There are three levels of statistics course at most universities. Here is a slightly tongue-in-cheek summary of each of them:

  1. “Cookbook” courses where large numbers of unwilling students are pressured into learning statistics by rote, usually taught by a teacher who is not there voluntarily either. A recipe for disaster. The convoy moves at the speed of its slowest ship, so days are spent on what a histogram is. You can detect whether a course is a cookbook course by looking in the Index of its textbook. If neither “likelihood” or “moment-generating function” are mentioned, that is a cookbook course.
  2. An upper-undergraduate statistics course that tries to explain some of the logic. These are probably the level you would want. Their textbooks do have index entries for “likelihood” but usually not for “moment-generating function”. One difficulty is that the examples used in the course are probably not similar to the problems we encounter in genomics, but may instead be motivated by other fields such as engineering or medical clinical trials. Here at UW some this level of course is represented by Statistics / Math 390. The Statistics 340-341-342 sequence is at that level, but is a more complete mathematical statistics sequence (see below).
  3. A full graduate-level mathematical statistics course. The textbook index mentions both “likelihood” and “moment-generating function”. These sequences are generally full-year courses intended to get a budding mathematical statistician up to speed. Some of the methods taught are not methods of data analysis but methods of proving theorems (such as how to prove convergence of a quantity to a particular distribution). This may even leave you feeling wise and powerful but unable to analyze any actual data.
Here is what you need to do in addition to this course: take a real statistics course. Here is what you should not do after taking this course: tell your thesis committee “oh yes, I've had statistics!” If you do I will come after you and personally lash you with a wet noodle.


News about the course

These news items have the newest ones last.


A rough syllabus (to be improved)

  1. (May 3) Probability. Stochastic processes (coins, phone calls, normals) Distributions (uniform, binomial, hypergeometric, geometric, exponential, Poisson, normal, lognormal)
  2. (May 5) Distributions, cont'd. Histograms, etc. Quantiles, distributions of functions of (multiples, averages, sums, sums of squares, differences) Practice R
  3. (May 10) Confidence intervals, t-test, experimental design, tests
  4. (May 12) Chi-squares, contingency tables
  5. (May 17) Regression, curve fitting
  6. (May 19) ANOVA, F-tests
  7. (May 24) Bayesian inference, likelihood
  8. (May 26) Jackknife, bootstrap, permutation tests, cross-validation
  9. (May 31) Multiple testing: Bonferroni, modifications of, FDR
  10. (June 2) Principal (no, not “principle”) components and SVD

Lecture PDFs

The lecture PDFs will be posted here. One from this year and others from last year are linked here now (the latter are marked here as "old").


Audio recordings of lectures

The lectures will be recorded and made available as WMA and as MP3 files here. They will be recorded at medium quality, the files about 10 Mb in size. Each file has a name which is the date on which it was recorded, such as 20110503.WMA or 20110503.mp3.

May 3 May 5 May 10 May 12 May 17 May 19 May 24 May 26 May 31 June 2
20110503.WMA
20110503.mp3
20110505.WMA
20110505.mp3
20110510.WMA
20110510.mp3
20110512.WMA
20110512.mp3
20110517.WMA
20110517.mp3
20110519.WMA
20110519.mp3
20110524.WMA
20110524.mp3
20110526.WMA
20110526.mp3
20110531.WMA
20110531.mp3
20110602.WMA
20110602.mp3


Books

There is no textbook for the course. Josh Akey, in last year's web pages, lists some books and a number of on-line statistics texts available free on the web. They are

In fact, a whole bunch of on-line statistics textbooks will be found if you Google: "online statistics text"

Josh's 2008 course web pages are excellent, especially his lecture PDFs. Although the order of material is different, they are very much work looking at. They are here


The R language

R is a free interactive computer environment (in old-fashioned terms, an "interpreter") that can be used for many purposes. It was originally designed by statisticians (R is a clone of a language called S, which is now commercial). It has many built-in statistics functions, which is why we will use it. (At the main CRAN-R project site there are links to many other analysis packages that can be loaded into R).

R can be downloaded and installed on Windows, Mac OS X, or Linux machines (and some other types as well). It is available at the CRAN-R site here as executables, source code, and many other resources including a terse PDF introductory manual. When using this manual skip over parts that go too deeply into stuff you don't yet understand as there is valuable stuff after that. Come back to the skipped stuff later.

Here are some R resources that you may find helpful:

Is R great? For many things, yes. Is it good at everything? I would say that its array operations stand a good chance of driving the puzzled beginner absolutely bonkers, so no. In this it reminds me of a now-fortunately-defunct programming language called APL ("A Programming Language") which could do many things interactively, had fervent evangelists, was mostly about arrays, and drove me absolutely bonkers. I wrote this assessment two years ago. Since then I have learned more R, but I do not see any reason to change what I said.


R in this course (and R exercises to be done in class)

We will do an R exercise in each class session. Students are expected to bring a laptop with R loaded on it (kudos to this year's class for doing this successfully). I will have exercise sheets for each class, and we will try to do them. As I make them I will post them here.


Homework assignments

Homework assignments will be posted here on Thursdays, to be turned in by email to me by the following Thursday:


The lecturer

Who is this guy who is teaching the course, anyway?


this page maintained fitfully by Joe Felsenstein