Genome 560 Statistics for Genomicists
Spring 2011 J. Felsenstein
Exercise 2
Let's try these operations. I won't show the results here -- you will
see them. I will comment, after some of the steps % like this
but you don't want to type those comments!
Note that all of these functions have "manual pages" that can be
seen if you type their name preceded by a "?" character: such as ?mean
You will need the data set "genes1000" in the folder from which you run R.
You can download it from the web page of the course.
1. Read in some data, looking at it and its descriptive statistics.
x <- read.table("genes1000") % read in a "data frame" of 1000 logged gene
% expression levels for genes in one individual
a <- as.numeric(as.matrix(x[2:1001,2])) % ... and put the second column into
% a vector (yes, this is obscure!)
a % print out the vector "a" of 1000 numbers. Big!
hist(a, nclass=40) % look at a histogram of 40 classes of "a"
% you can close that window or leave it open
mean(a) % now for some descriptive statistics: the mean
var(a) % ... and the variance
sd(a) % ... and the standard deviation
median(a) % ... and the median
quantile(a, 0.025) % ... and the lower 2.5% point
quantile(a, 0.975) % ... and the upper 2.5% point
summary(a) % a bunch of these (min, 25%, median, 75%, max)
boxplot(a) % John Tukey's "boxplot" of quantiles and outliers
% also called a "box-and-whiskers plot". It
% shows the minimum, 25% point, median, 75% point,
% and maximum. It uses some arbitrary rules to
% decide which points are too far up or down and
% should be shown as outliers (circles). If you
% don't want it to decide that, use this command:
boxplot(a,range=0) % ... see?
% you can close those windows if you wish, or not
aa <- ecdf(a) % the empirical cumulative distribution function
2. Some special ways of making vectors you may want
b <- sort(a) % that sorts "a" in ascending order
d <- seq(1,1000) % a vector of integers 1 to 1000 (Note: never use
% a variable named "c")
e <- seq(10,100,0.1) % a vector of numbers 10, 10.1, 10.2, ..., 99.1, 100.0
f <- c(2,4,3.3) % a vector of three numbers, 2, 4, 3.3
g <- rep(c(2,4,3.3), times=30) % a vector with those three repeated 30 times
3. Plotting commands
Plotting in R is fun (and a bit crazy-making too). Let's try some methods:
(This is plotting a single vector. Plotting two vectors against each other
is more useful, see below).
plot(a) % plots the 1000 points against the integers 1 ... 1000
% What is it? hard to look at, isn't it? Not much use.
plot(a, type="l") % connects them by lines instead. Still useless
b <- sort(a) % sort them in order
plot(b, type="l") % now we have a (sideways) cumulative distribution!
plot(b) % or this way ...
d <- as.numeric(as.matrix(x[2:1001,3])) % also make a vector of the replicate
% measurements which are in column
% three of the data table
plot(d,a,type="p") % and plot as points one against the other
% (with d as X and a as Y). OK, so suppose that ...
jpeg("myfile.jpg") % we want to plot into a file (in JPG graphics format)
% instead of "jpeg" we could have said "pdf" or even
% "postscript" but then we probably would want to call
% the file "myfile.pdf" or "myfile.ps"
plot(d,a,type="p") % so we do the plot again and it goes to that file
dev.off() % turn off that graphics device so you see plots again
% it will say "null device" but that's OK.
4. OK, if you want to stop, remember, you say: q()
and you say "y" (yes, save all these variables to be available next time)