The purpose of this assignment is to give you some experience with the UCSC Genome Browser, with whole-genome analyses, and with comparative sequence analysis. You are going to be looking for genomic regions that are extremely well conserved across the 33 placental mammals available in the latest UCSC Browser database.
I have arranged the assignment in steps below, so that you can go as far as your time and interest permit. You don't have to complete all the steps, but the more steps you complete, the higher your score will be.
The phyloP files are divided into blocks that begin with a header line that looks like "fixedStep chrom=chrY start=14818 step=1". The i-th number in the block that follows is the phyloP value for human coordinate start+i-1. For efficiency, I suggest that you process blocks as follows. Use a 1-dimensional array "sum" whose i-th entry is the sum of the first i phyloP scores in the current block. When you read the i-th score, you can add it to sum[i-1] to get sum[i], and you can also compute the average phyloP score for the region ending at this position by (sum[i] - sum[i-200])/200.
The phyloP files are large (hundreds of megabytes each), and you have about 3 billion numbers to process over the whole genome. If it turns out for some reason that these are too big for your computer to handle, then just do chromosome 14 rather than the whole genome. There are instructions for how to download these big files efficiently at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/.
For each extremely conserved region you discover, report it as a pair
where "location" gives the human coordinates in the form chr14:100213-100412 and "score" is the average phyloP score for this region. When you first output regions, you will want to check some of them in the UCSC Genome Browser, looking at the phyloP placental mammal track to ensure that it does indeed look high across your region. Be sure you choose the Human Feb. 2009 (GRCh37/hg19) Assembly in the browser, because your human coordinates won't be correct for earlier assemblies.
Add one of the annotations "coding", "UTR", "intron", or "intergenic" to each of your conserved regions. In the first three cases, also provide the name of the human gene. In the intergenic case, give the distance to the closest human gene. If there are isoforms listed that would give contradictory annotations, list all possible annotations. For example, one of your regions might be labeled both coding and intronic if it occurs in a coding exon that is alternatively spliced out.
Send your results to tompa AT cs.washington.edu. I would prefer to get your results as an Excel spreadsheet with one region per line, but it is fine to send it as a text file with one regions per line.