Genome 560 Statistics for Genomicists Spring 2011 J. Felsenstein Exercise 2 Let's try these operations. I won't show the results here -- you will see them. I will comment, after some of the steps % like this but you don't want to type those comments! Note that all of these functions have "manual pages" that can be seen if you type their name preceded by a "?" character: such as ?mean You will need the data set "genes1000" in the folder from which you run R. You can download it from the web page of the course. 1. Read in some data, looking at it and its descriptive statistics. x <- read.table("genes1000") % read in a "data frame" of 1000 logged gene % expression levels for genes in one individual a <- as.numeric(as.matrix(x[2:1001,2])) % ... and put the second column into % a vector (yes, this is obscure!) a % print out the vector "a" of 1000 numbers. Big! hist(a, nclass=40) % look at a histogram of 40 classes of "a" % you can close that window or leave it open mean(a) % now for some descriptive statistics: the mean var(a) % ... and the variance sd(a) % ... and the standard deviation median(a) % ... and the median quantile(a, 0.025) % ... and the lower 2.5% point quantile(a, 0.975) % ... and the upper 2.5% point summary(a) % a bunch of these (min, 25%, median, 75%, max) boxplot(a) % John Tukey's "boxplot" of quantiles and outliers % also called a "box-and-whiskers plot". It % shows the minimum, 25% point, median, 75% point, % and maximum. It uses some arbitrary rules to % decide which points are too far up or down and % should be shown as outliers (circles). If you % don't want it to decide that, use this command: boxplot(a,range=0) % ... see? % you can close those windows if you wish, or not aa <- ecdf(a) % the empirical cumulative distribution function 2. Some special ways of making vectors you may want b <- sort(a) % that sorts "a" in ascending order d <- seq(1,1000) % a vector of integers 1 to 1000 (Note: never use % a variable named "c") e <- seq(10,100,0.1) % a vector of numbers 10, 10.1, 10.2, ..., 99.1, 100.0 f <- c(2,4,3.3) % a vector of three numbers, 2, 4, 3.3 g <- rep(c(2,4,3.3), times=30) % a vector with those three repeated 30 times 3. Plotting commands Plotting in R is fun (and a bit crazy-making too). Let's try some methods: (This is plotting a single vector. Plotting two vectors against each other is more useful, see below). plot(a) % plots the 1000 points against the integers 1 ... 1000 % What is it? hard to look at, isn't it? Not much use. plot(a, type="l") % connects them by lines instead. Still useless b <- sort(a) % sort them in order plot(b, type="l") % now we have a (sideways) cumulative distribution! plot(b) % or this way ... d <- as.numeric(as.matrix(x[2:1001,3])) % also make a vector of the replicate % measurements which are in column % three of the data table plot(d,a,type="p") % and plot as points one against the other % (with d as X and a as Y). OK, so suppose that ... jpeg("myfile.jpg") % we want to plot into a file (in JPG graphics format) % instead of "jpeg" we could have said "pdf" or even % "postscript" but then we probably would want to call % the file "myfile.pdf" or "myfile.ps" plot(d,a,type="p") % so we do the plot again and it goes to that file dev.off() % turn off that graphics device so you see plots again % it will say "null device" but that's OK. 4. OK, if you want to stop, remember, you say: q() and you say "y" (yes, save all these variables to be available next time)