Department of Genome Sciences University of Washington Box 355065 Seattle, Washington 98195-5065 |
DNATREE is a computer program that simulates the branching of an evolutionary tree, using a model of random branching of lineages. It then evolves a DNA sequence along this tree, and displays the resulting sequences. The user can save the tree and the sequences if they want to use them in other programs. The user can control the expected rate of evolution of the sequences per unit time, and the transition/transversion ratio.
After the tree and the sequences are simulated, the program creates a randomly-chosen tree for the same species, one unrelated in structure to the true tree. The user can then rearrange the tree by hand. The resulting trees are displayed as they are made, and for each the number of base substitutions is computed that is the smallest number that would be necessary to evolve the previously simulated data on that tree. The user can then search by hand for the most parsimonious tree, and see if that recovers the true tree. The number of sites compatible with each tree is also shown, so that the user can alternatively search for the maximum-compatibility estimate of the tree.
DNATREE was created by combining two simulation programs used by our laboratory, together with code from the PHYLIP program DNAMOVE, the interactive DNA parsimony program. The simulation programs are versions of the ones used in the extensive simulations of the behavior of phylogeny methods reported by Kuhner and Felsenstein (1994). Although visually and functionally it is much more primitive, DNATREE can carry out some of the simulation and rearrangement options present in Wayne and David Maddison's excellent program MacClade.
At our web site you will find a number of different kinds of executables for DNATREE, as well as the source code in C.
(if you see the contents of the file on your screen when you choose it, you will want to use the Save function of your browser to save a copy of it).
If you want to give copies to friends you are free to, (though you cannot resell it without our permission and the payment of royalties to us). If you modify the source code you are free to (the same restrictions on resale apply). The source code is easily recompiled on Unix systems by issuing the command make. On Windows or PowerMac systems compilation is much harder owing to the outrageously incompatible ``project" structures of the major C compilers. At this point you should imagine me launching into a tirade about how they assume that no one will ever want to compile the code on anyone else's compiler. (We will omit the details of the tirade here).
This is fairly easy. For the Windows version and the older PowerMac and 68k Macintosh versions, simply double-click on the DNATREE icon.
The Windows version will also run in an ``Command" tool window. It may first appear in a rather small and unreadable window. By clicking on the symbol in the upper-right corner of the window (the one with a single large box) you should be able to get DNATREE to take up the whole screen and be much more visible.
The remainder of this section concerns getting it to run on Unix and Linux. These instructions also apply to the Mac OS X executable. That executable can be run as a Unix program by using the Terminal tool which is in the Applications folder.
For these versions, make sure that the file has ``execute permission". This it will have if you made it by compiling it yourself, but if not, all you need to do is issue the command
chmod +x dnatree.linux
(or whatever the name of the executable file is). This only has to be done once - the file will thereafter have execute permission and you need not do this any other time you log in. To run the executable once it has execute permission, just type the name of the file (such as dnatree.linux). You can simplify things by renaming the file to dnatree by a command such as
mv dnatree.linux dnatree
Again, that only needs to be done once. If you compiled the program yourself, the executable is already called dnatree.
To run the program, type its name. If your ``path" does not include the current directory, you may alternatively have to type
./dnatree
where the ``./" ensures that the executable comes from the current directory.
.
The menus. DNATREE has three menus that the user must use. The first one sets the parameters for simulating the branching of the tree. The second one sets the parameters for the evolution of the DNA along that tree, and the third one sets the parameters for the interactive search for the most parsimonious tree.
Simulating the branching of the tree. The first menu looks something like this:
DNATREE Molecular evolution and phylogeny version 1.2 Tree menu: S Simulate a tree? Yes, simulate N Number of species? 10 C Number of sites? 200 T Transition/transversion ratio? 2.000000 0 Graphics type (IBM PC, ANSI, none)? ANSI Type character or type Y to accept these settings
Each of these lines tells of one option, shows its current setting, and gives you the opportunity to change it by typing a single letter (followed by pressing on the Enter key), S, N, C, T, or 0.
For the time being, simulating a tree is the only option provided, so the first menu item need not be changed. For the next two items, you can simulate trees of any size up to 100 species and DNA sequences of 2000 sites. For an initial run you might stick to the default values of 10 and 200. The fourth item, the Transition/Transversion ratio, we have provided a default value that is typical of mammalian DNA but you can choose another value if you want, anywhere from 0.5 up. The Graphics type option (0) is probably set correctly for your screen, but if the trees that are printed out have strange characters you will want to try other values for this option.
If you change any of the values, the menu will be redisplayed. On deciding that you have the correct values and want to go ahead, select the Y (Yes) option. The program will now ask you for two numbers, the random number seed and the rate of evolution. The random number seed controls the random decisions that the program makes. It can be any number from 1 up to about 1000000000. It will be wise for you to remember the number you used.
After you choose from this menu, you will be asked to provide a random number seed and a rate of evolution.
The random number seed is used to start the sequence of random numbers that are generated. A property of pseudorandom numbers is that they are not predictable one from the other. Thus there are no rules as to what particular numbers lead to what particular kinds of results. Thus if you choose a random number seed of (say) 38, you need not worry about whether you have somehow biased the results as compared to choosing, (say) 692548236. But the sequence is deterministic. Thus if you do another run with the same parameter settings and the same random number seed, you should get the same results from DNATREE even on a different computer. This gives you the ability to duplicate any run that gives strange outcomes, if you happen to have written down the random number seed.
The rate of evolution is a positive number. It specifies what the expected numbers of substitutions per site will be per unit time. A unit of time is the time it takes, on average, for a lineage to split. This if you use the value 0.1, each site will accumulate approximately 0.1 substititions along each interior branch in the tree (but this will vary with the exact length of the branch). The terminal branches (the ones that lead to tips) will accumulate fewer changes as they do not continue until they branch.
If you choose a small value (say 0.01) of the evolutionary rate, the sequences will be very similar and differ at few positions. There will scarcely be any sites that will have undergone two or more substitutions in the course of the evolution of these sequences. But if you choose a larger value such as 10.0, you will see so many changes that there will be little similarity between the sequences, and a lot of changes whose effects overlay each other. There will then be a lot of evolutionary noise that will obscure the relatedness of the sequences.
The tree that is printed out will look something like this:
,-1:D ,---------------------------------------------------19 ! `10:G ! ! ,---------2:H ! ,--15 ! ,-------14 `---------6:E -11 ! ! ! ! `-------------5:F ! ,-----------------------13 ! ! ! ,-----4:J ! ! ! ,-18 ! ! `---------------17 `-----9:A `-----12 ! ! `-----8:C ! ! ,---------3:I `------------------------------------16 `---------7:B
The horizontal branch lengths are roughly proportional to the evolutionary time that elapsed in the simulation. All the lineages evolved for the same amount of time, measured from the root of the tree at the left. There was a ``molecular clock" in that all lineages existed for equal amounts of time, and all evolved at the same average rate. But note that in the above diagram, there seems to be a violation of that, as J and A stick out farther to the right than do the other species. This is because the line from node 17 to node 18 was actually very short, but it has been expanded a bit to make it possible to read the node numbers. Other small inqualities in the horizontal position of the tips can result from rounding off of branch lengths to the discrete number of columns in your page.
Important note: At the moment, DNATREE has no way of saving the original tree, except by writing it into a file in the standard parenthesis notation. This you may find hard to read, so if you want to compare your original tree to the parsimony estimate, I strongly recommend that you write the tree down by hand. You may miss it later on if you do not. On Macintosh systems you should be able to print a copy of the screen by using the Print command in the File menu.
You will be asked whether you want to write the tree out into a file. You need not do this, unless you want to take this tree and use it as input to some other program. So most people will want to answer N (No).
Simulating the evolution of the DNA. Next you will see the first 60 sites of the DNA sequences whose evolution was simulated at an imaginary locus along that tree. The sequences look something like this:
Here are the sequences that were simulated along that tree. A dot (.) means that the base is the same as in the first species. Here are sites 1 through 60 out of a total of 200. ----------------------------------------------------------------------------- A CGTGGGACAA CGACCATCAA AGCCTCGCGG TCTCCCTCAA TCACTGGTGC CGGTTTCGCA B .......... .........C .......... .......... G.G......T A.....T... C .......... .........C .......... .......... G......... .......... D .......... .........C .......... .......... G.G....... A.....TA.. E .......... .......T.C G......... .......... G......... A......... F .......... .........C .......... .......... G......... .......... G ....A.TT.. T.GTT....G ..T.AT...T .......T.C G.G.A.A.AT T.......G. H G......... .........C .......... .......... G.....A..T ....C..... I T.....T... .......G.C .......... ......G... G......... .......... J ....A.TTGG .AGTT....C ..T.A....T .......T.. G.G...A..T T.....A.G. -----------------------------------------------------------------------------
The names (in this case A through J, but in larger cases they can have two letters) are the tips of the tree that you saw earlier. Each sequence is given a up to six blocks of 10 bases each. The dots (".") are to make the differences easier to detect. They mean ``the same as in the first species. Thus at the first site in the above data, all species have "C" except species H and I which have G and T, respectively. If you find the dots hard to look at, you will in a moment get the opportunity to turn them off and show all the letters ACGT at each site (at which point you'll realize why we had the dots).
The statistical model of base change that this program uses is the Kimura 2-parameter model. This was introduced by the famous theoretical population geneticist Motoo Kimura (1980). It is the simplest possible model of DNA change that embodies unequal rates of transitions (A <-->G and C <-->T) and transversions (the other changes, in which one base is a purine and the other a pyrimidine). It will lead to (approximately) equal frequencies of the four bases.
Below the data display, you will see the menu:
Choose what to do next: N See the next group of sites D Turn off using dot-differencing Y Finish looking at sequences Type one of these letters, then press on Enter key
The N option shows you successive groups of 60 sites. The P option does the opposite, going back to previous groups of 60. Neither is available when one is near the relevant end of the sequences. The Y option is what you type when you are tired of looking at the sequences and want to move on to the next part of the run.
The program will ask you if you want to save the data in a file. You do not need to do this, unless you want to analyze the data in another program. This program keeps track of the data internally. Thus most users will want to answer N (No) at this point.
Searching for the most parsimonious tree.
After the data has been simulated, it is time to look at possible phylogenies and see how well they fit the data. The program now has the third menu, which looks like this:
Interactive search for most parsimonious tree Settings for this run: O Outgroup root? No, use as outgroup species 1 T Use Threshold parsimony? No, use ordinary parsimony L Number of lines on screen? 24 Are these settings correct? (type Y or the letter for one to change)
You will probably not want to change these settings. If you have more than 12 species, you will want to increase the number of lines on the screen, if these could be visible. The outgroup root at species 1 is also arbitrary, but that can be changed later.
If you type Y (Yes) and accept the values, the program then generates a phylogeny at random, and displays it. You are then able to rearrange this, watching the parsimony (or compatibility) scores of the tree, to try to find the most parsimonious (or maximum compatibility) tree by yourself. You can also check on individual sites in the sequences, to see how well they fit the tree. The tree will initially look something like this (appearances vary on different types of computers):
,--------------------------3:H ! ! ,----------------------10:D -12 ! ! ! ,--------6:G Steps: 319.00000 ! ! ! `-19 ,----------15 ,--9:I ! ! ! ,-18 ! ! `-16 `--7:J 118 sites compatible ! ! ! `-11 `-----2:F ! ! ,--------8:B ! ! `----------17 ,--5:C ! ,-14 `-13 `--4:E ! `-----1:A NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)
The tree grows from left to right, and each node on the tree (each fork or tip) has a number, to allow us to refer to it. The names of the species are also available at the tips. In addition, the numbers of changes of state (steps) that are needed to evolve the data set on this tree are shown. The number of sites that are compatible with the tree (that perfectly fit it) is also shown. The tree that is shown is a random one, not a good estimate.
You will probably want to rearrange this tree to find the most parsimonious (or alternatively, the most compatible) tree. Thus you will want to minimize the number of Steps, or maximize the number of sites compatible with the tree. The menu at the bottom of the tree shows a number of commands that you can issue. The most important of these are:
ccccccccccccccccccccccccccc3:H SITE 4 c c ttttttttttttttttttttttt10:D a:A, c:C, g:G, t:T, .:? 12 t . t ccccccccc6:G Steps: 319.00000 . t c ..19 ccccccccccc15 ccc9:I . c c cc18 . c cc16 ttt7:J 118 sites compatible . c c ..11 cccccc2:F t t ttttttttt8:B t t ttttttttttt17 ccc5:C t tt14 tt13 ttt4:E t tttttt1:A NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)
.
Much of the code was written by me, Joe Felsenstein. Large parts of the tree-rearrangement code are by Andrew Keeffe, and the C conversion of that code was done by Akiko Fuseki. This program is used by me for teaching my Genome 453 and Genome 570 courses, and may be useful to others as a teaching tool.
Thanks to Tudor Ionescu for finding a bug and fixing it.
This program and its documentation is copyright to The University of Washington, 1999-2004. Its free distribution and use is permitted, including using multiple copies in classrooms. Resale or use in a commercial product is not permitted without our permission.
If you want to get the PHYLIP package, which is free, you will find it on the Web at this site, and the most comprehensive listing of available phylogeny programs will also be found at that site.
Yes, I know, we should have done it your favorite way. But consider:
Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120.
Kuhner, M. K. and J. Felsenstein. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution 11: 459-468 (Erratum 12: 525 1995).