Genome 560
Statistical Genomics (actually it should be called Statistics for Genomicists)
Spring, 2011
Table of contents:
What this course is and isn't
I inherited this course, and I'm not entirely sure what it is supposed to
be. Its title implies that it will tell you about advanced statistical
methods in genomics. It won't. It will not tell you about:
 probabilities of sequence motifs
 distributions of restriction sites
 ways to test differences in base composition
 Markov chain Monte Carlo methods
 Hidden Markov Models
 EM algorithms
 Phylogeny inference methods
 coalescents
 statistical aspects of population genetics
 pedigree analysis
 tracts of identity by descent
 junctions approaches to inbreeding
 genome rearrangement statistics
and all sorts of other advanced genomic topics.
What is does seem to be is a statistics
refresher, for genomics students who have had a “cookbook”
statistics course. For those who have not had any statistics, it is a
minimal introduction. But I do not think that it can be considered
an adequate substitute for a real statistics course. Genome Sciences
graduate students ought to have a full statistics course.
There are three levels of statistics course at most universities. Here
is a slightly tongueincheek summary of each of them:
 “Cookbook” courses where large numbers of unwilling students
are pressured into learning statistics by rote, usually taught by
a teacher who is not there voluntarily either. A recipe for disaster.
The convoy moves at the speed of its slowest ship, so days are spent
on what a histogram is. You can detect whether a course is a cookbook
course by looking in the Index of its textbook. If neither
“likelihood” or “momentgenerating function” are
mentioned, that is a cookbook course.
 An upperundergraduate statistics course that tries to explain
some of the logic. These are probably the level you would want. Their
textbooks do have index entries for “likelihood” but usually not for
“momentgenerating function”. One difficulty is that
the examples used in the course are probably not similar to the problems we
encounter in genomics, but may instead be motivated by other fields such
as engineering or medical clinical trials.
Here at UW some this level of course is represented by Statistics / Math 390.
The Statistics 340341342 sequence is at that level, but is a more complete
mathematical statistics sequence (see below).
 A full graduatelevel mathematical statistics course. The textbook
index mentions both “likelihood” and “momentgenerating
function”. These sequences are generally fullyear courses intended to
get a budding mathematical statistician up to speed. Some of the methods
taught are not methods of data analysis but methods of proving theorems (such
as how to prove convergence of a quantity to a particular distribution). This
may even leave you feeling wise and powerful but unable to analyze any actual data.
Here is what you need to do in addition to this course: take a real
statistics course. Here is what you should not
do after taking this
course: tell your thesis committee “oh yes, I've had statistics!”
If you do I will come after you and personally lash you with a wet noodle.
News about the course
These news items have the newest ones last.
 The course meets in S110 Foege Building, the seminar room on the
1st floor of Foege (its door is across from the white phone in the hall).
It meets at 9:0010:20 Tuesdays and Thursdays, May 2 onwards.
 There is a course mailing list. Registered students are on it
automatically. For others, to join it or read past postings,
go to this
link. It requires a UW login, and past postings can be read only once
you are a list member. Members can post to the mailing list. I will try to
keep anyone from abusing the list and sending spam or really offtopic stuff.
 The course will be graded on homework assignments assigned here
each Thursday, to be turned in the following Tuesday. They will make use
of the R package.
 The R statistical package will be used in homeworks. Students
should load it onto their own laptop and bring the laptop to class, as
we will do exercises, group learning, and group confusion the last 2030
minutes of each lecture. R is free and can be downloaded from the CranR
web page
here. A brief introduction
document to R is available there as a PDF.
Here is
another good set of pages that introduces R.
Two good quick example
sheets were made up by Josh Akey two years ago. They are
An R tutorial
and An R descriptive statistics
tutorial.
They show you the commands to do some relevant things, but do
not show you what appears after you type in the commands. I recommend that
you try typing in these commands yourself  it will be quite helpful.
 The grades were submitted (on time). I have posted a histogram
of the grade point totals (out of 100) with some indication of which ones
got which grades. The histogram is a PDF that can be found here.
 Later today I hope to have the homework papers returned to the students'
mailboxes in Foege. If you do not have a mailbox there, please email me as
to where to send the homework papers.
A rough syllabus (to be improved)
 (May 3) Probability. Stochastic processes (coins, phone calls, normals)
Distributions (uniform, binomial, hypergeometric, geometric, exponential, Poisson, normal, lognormal)
 (May 5) Distributions, cont'd. Histograms, etc. Quantiles, distributions of
functions of (multiples, averages, sums, sums of squares, differences)
Practice R
 (May 10) Confidence intervals, ttest, experimental design, tests
 (May 12) Chisquares, contingency tables
 (May 17) Regression, curve fitting
 (May 19) ANOVA, Ftests
 (May 24) Bayesian inference, likelihood
 (May 26) Jackknife, bootstrap, permutation tests, crossvalidation
 (May 31) Multiple testing: Bonferroni, modifications of, FDR
 (June 2) Principal (no, not “principle”) components and SVD
Lecture PDFs
The lecture PDFs will be posted here.
One from this year and others from last year are linked here now (the latter
are marked here as "old").
Audio recordings of lectures
The lectures will be recorded and made available as WMA and as MP3 files
here. They will be recorded at medium quality, the files about 10 Mb in size.
Each file has a name which is the date on which it was recorded, such as
20110503.WMA or 20110503.mp3.
Books
There is no textbook for the course. Josh Akey, in last year's
web pages, lists some books and a number of online statistics
texts available free on the web. They are
In fact, a whole bunch of online statistics textbooks will be
found if you Google: "online statistics text"
Josh's 2008 course web pages are excellent, especially his lecture
PDFs. Although the order of material is different, they are
very much work looking at.
They
are here
The R language
R is a free interactive computer environment (in oldfashioned terms,
an "interpreter") that can be used for many purposes. It was originally
designed by statisticians (R is a clone of a language called S, which is
now commercial). It has many builtin statistics functions, which is
why we will use it. (At the main CRANR project site there are links to
many other analysis packages that can be loaded into R).
R can be downloaded and installed on Windows, Mac OS X, or Linux machines
(and some other types as well). It is available at the CRANR site here as executables, source code, and
many other resources including a terse PDF
introductory manual. When
using this manual skip over parts that go too deeply into stuff
you don't yet understand as there is valuable stuff after that.
Come back to the skipped stuff later.
Here are some R resources that you may find helpful:
 Here is
another good set of pages that introduces R.
 Christopher Green, of our Statistics Department, produced a good
primer on the use of R, aimed at statistics applications. It can be
downloaded as a PDF here.

Emanuel Paradis has made available on line a good
introduction to R R for
Beginners which is also available in a French version.

This
set of (PDF) slides by Thomas Lumley of our Biostatistics Department gives much
useful information about ways to do things in R.

Two good quick example
sheets were made up by Josh Akey two years ago. They are
An R tutorial
and An R descriptive statistics
tutorial.
They show you the commands to do some relevant things, but do
not show you what appears after you type in the commands. I recommend that
you try typing in these commands yourself  it will be quite helpful.

In past years Brooks Miner and Sylvia Yang of the Biology Department
have given a workshop on use of R for biologists, toward the end of
Spring Quarter. I don't know whether they will do it again this year (I don't
think they will as they are busy).
Is R great? For many things, yes. Is it good at everything? I would
say that its array operations stand a good chance of driving the
puzzled beginner absolutely bonkers, so no. In this it
reminds me of a nowfortunatelydefunct programming language called APL ("A
Programming Language")
which could do many things interactively, had fervent evangelists,
was mostly about arrays,
and drove me absolutely bonkers. I wrote this assessment two years
ago. Since then I have learned more R, but I do not see any reason to change
what I said.
R in this course (and R exercises to be done in class)
We will do an R exercise in each class session. Students are
expected to bring a laptop with R loaded on it (kudos to this year's
class for doing this successfully). I will have exercise sheets
for each class, and we will try to do them. As I make them I will
post them here.
Homework assignments
Homework assignments will be posted here on Thursdays, to be turned in by
email to me by the following Thursday:
 Homework 1 is available here. It is to be turned
in by the following Thursday, May 12.
 Homework 2 is available here. It is to be turned
in by the following Thursday, May 19.
 Homework 3 is available here. Because it
involves material from the May 24 lecture, it does not need to be turned in
until Thursday, June 2. The contingency table of data on sick people is
available here.
 Homework 4 is available here. The 100gene data
set is available here: here. Due Saturday,
June 4.
The lecturer
Who is this guy who is teaching the course, anyway?
this page maintained fitfully by Joe Felsenstein