Homework #2: Comparative Sequence Analysis

Genome 541: Introduction to Computational Molecular Biology
April 8, 2010
due: Sunday, April 18, 2010

The purpose of this assignment is to give you some experience with the UCSC Genome Browser, with whole-genome analyses, and with comparative sequence analysis. You are going to be looking for genomic regions that are extremely well conserved across the 33 placental mammals available in the latest UCSC Browser database.

I have arranged the assignment in steps below, so that you can go as far as your time and interest permit. You don't have to complete all the steps, but the more steps you complete, the higher your score will be.

  1. Download the phyloP placental mammal conservation scores. These are available in 24 files (one for each chromosome) of the form http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/placentalMammals/chr*.phyloP46way.placental.wigFix.gz, where you have to substitute the chromosome name for "*" in this filename. Write a program that reports all nonoverlapping regions of length 200 bp for which the average phyloP score is at least 2.35. You should treat any negative phyloP score as though it were 0 in computing this average (for reasons you can learn by reading the phyloP paper if you are curious).

    The phyloP files are divided into blocks that begin with a header line that looks like "fixedStep chrom=chrY start=14818 step=1". The i-th number in the block that follows is the phyloP value for human coordinate start+i-1. For efficiency, I suggest that you process blocks as follows. Use a 1-dimensional array "sum" whose i-th entry is the sum of the first i phyloP scores in the current block. When you read the i-th score, you can add it to sum[i-1] to get sum[i], and you can also compute the average phyloP score for the region ending at this position by (sum[i] - sum[i-200])/200.

    The phyloP files are large (hundreds of megabytes each), and you have about 3 billion numbers to process over the whole genome. If it turns out for some reason that these are too big for your computer to handle, then just do chromosome 14 rather than the whole genome. There are instructions for how to download these big files efficiently at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP46way/.

    For each extremely conserved region you discover, report it as a pair
    location score
    where "location" gives the human coordinates in the form chr14:100213-100412 and "score" is the average phyloP score for this region. When you first output regions, you will want to check some of them in the UCSC Genome Browser, looking at the phyloP placental mammal track to ensure that it does indeed look high across your region. Be sure you choose the Human Feb. 2009 (GRCh37/hg19) Assembly in the browser, because your human coordinates won't be correct for earlier assemblies.

  2. For each extremely conserved region output in step 1, write a program that determines whether it overlaps a human coding exon, whether it overlaps a UTR exon, whether it is contained within an intron, or whether it is contained in an intergenic region. You can determine these gene locations from the single file http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz, which provides human gene annotations. To understand the format of this file, go to the UCSC Table Browser, select
    clade: Mammal
    genome: Human
    assembly: Feb. 2009 (GRCh37/hg19)
    group: Genes and Gene Prediction Tracks
    track: UCSC Genes
    table: knownGene
    and click on "describe table schema".

    Add one of the annotations "coding", "UTR", "intron", or "intergenic" to each of your conserved regions. In the first three cases, also provide the name of the human gene. In the intergenic case, give the distance to the closest human gene. If there are isoforms listed that would give contradictory annotations, list all possible annotations. For example, one of your regions might be labeled both coding and intronic if it occurs in a coding exon that is alternatively spliced out.

  3. The regions that you have discovered are all exceptionally well conserved across nearly all 33 placental mammals. Based on the regions you have discovered, can you provide some plausible hypotheses for the biological function of such extreme conservation? For instance, I'm guessing that you will only find a few hundred exonic examples across the whole genome, so simply hypothesizing that coding regions must be well conserved would not explain why so few exons appear in your list.

Send your results to tompa AT cs.washington.edu. I would prefer to get your results as an Excel spreadsheet with one region per line, but it is fine to send it as a text file with one regions per line.