DNATREE

[DNATREE icon here]

version 1.3

Joe Felsenstein
Department of Genome Sciences
University of Washington
Box 355065
Seattle, Washington 98195-5065
 
joe @ gs.washington.edu

DNATREE is a computer program that simulates the branching of an evolutionary tree, using a model of random branching of lineages. It then evolves a DNA sequence along this tree, and displays the resulting sequences. The user can save the tree and the sequences if they want to use them in other programs. The user can control the expected rate of evolution of the sequences per unit time, and the transition/transversion ratio.

After the tree and the sequences are simulated, the program creates a randomly-chosen tree for the same species, one unrelated in structure to the true tree. The user can then rearrange the tree by hand. The resulting trees are displayed as they are made, and for each the number of base substitutions is computed that is the smallest number that would be necessary to evolve the previously simulated data on that tree. The user can then search by hand for the most parsimonious tree, and see if that recovers the true tree. The number of sites compatible with each tree is also shown, so that the user can alternatively search for the maximum-compatibility estimate of the tree.

DNATREE was created by combining two simulation programs used by our laboratory, together with code from the PHYLIP program DNAMOVE, the interactive DNA parsimony program. The simulation programs are versions of the ones used in the extensive simulations of the behavior of phylogeny methods reported by Kuhner and Felsenstein (1994). Although visually and functionally it is much more primitive, DNATREE can carry out some of the simulation and rearrangement options present in Wayne and David Maddison's excellent program MacClade.

Getting DNATREE

At our web site you will find a number of different kinds of executables for DNATREE, as well as the source code in C.

http://evolution.gs.washington.edu/dnatree

(if you see the contents of the file on your screen when you choose it, you will want to use the Save function of your browser to save a copy of it).

Version 1.3

Version 1.1 (documentation web page is for 1.2)

If you want to give copies to friends you are free to, (though you cannot resell it without our permission and the payment of royalties to us). If you modify the source code you are free to (the same restrictions on resale apply). The source code is easily recompiled on Unix systems by issuing the command make. On Windows or PowerMac systems compilation is much harder owing to the outrageously incompatible ``project" structures of the major C compilers. At this point you should imagine me launching into a tirade about how they assume that no one will ever want to compile the code on anyone else's compiler. (We will omit the details of the tirade here).


Getting the executables to run

This is fairly easy. For the Windows version and the older PowerMac and 68k Macintosh versions, simply double-click on the DNATREE icon.

The Windows version will also run in an ``Command" tool window. It may first appear in a rather small and unreadable window. By clicking on the symbol in the upper-right corner of the window (the one with a single large box) you should be able to get DNATREE to take up the whole screen and be much more visible.

The remainder of this section concerns getting it to run on Unix and Linux. These instructions also apply to the Mac OS X executable. That executable can be run as a Unix program by using the Terminal tool which is in the Applications folder.

For these versions, make sure that the file has ``execute permission". This it will have if you made it by compiling it yourself, but if not, all you need to do is issue the command

chmod +x dnatree.linux

(or whatever the name of the executable file is). This only has to be done once - the file will thereafter have execute permission and you need not do this any other time you log in. To run the executable once it has execute permission, just type the name of the file (such as dnatree.linux). You can simplify things by renaming the file to dnatree by a command such as

mv dnatree.linux dnatree

Again, that only needs to be done once. If you compiled the program yourself, the executable is already called dnatree.

To run the program, type its name. If your ``path" does not include the current directory, you may alternatively have to type

./dnatree

where the ``./" ensures that the executable comes from the current directory.


How to use DNATREE

.

The menus. DNATREE has three menus that the user must use. The first one sets the parameters for simulating the branching of the tree. The second one sets the parameters for the evolution of the DNA along that tree, and the third one sets the parameters for the interactive search for the most parsimonious tree.

Simulating the branching of the tree. The first menu looks something like this:

 DNATREE Molecular evolution and phylogeny version 1.2

 Tree menu:
    S                         Simulate a tree?  Yes, simulate
    N                       Number of species?  10
    C                         Number of sites?  200
    T           Transition/transversion ratio?  2.000000
    0      Graphics type (IBM PC, ANSI, none)?  ANSI

 Type character or type Y to accept these settings

Each of these lines tells of one option, shows its current setting, and gives you the opportunity to change it by typing a single letter (followed by pressing on the Enter key), S, N, C, T, or 0.

For the time being, simulating a tree is the only option provided, so the first menu item need not be changed. For the next two items, you can simulate trees of any size up to 100 species and DNA sequences of 2000 sites. For an initial run you might stick to the default values of 10 and 200. The fourth item, the Transition/Transversion ratio, we have provided a default value that is typical of mammalian DNA but you can choose another value if you want, anywhere from 0.5 up. The Graphics type option (0) is probably set correctly for your screen, but if the trees that are printed out have strange characters you will want to try other values for this option.

If you change any of the values, the menu will be redisplayed. On deciding that you have the correct values and want to go ahead, select the Y (Yes) option. The program will now ask you for two numbers, the random number seed and the rate of evolution. The random number seed controls the random decisions that the program makes. It can be any number from 1 up to about 1000000000. It will be wise for you to remember the number you used.

After you choose from this menu, you will be asked to provide a random number seed and a rate of evolution.

The random number seed is used to start the sequence of random numbers that are generated. A property of pseudorandom numbers is that they are not predictable one from the other. Thus there are no rules as to what particular numbers lead to what particular kinds of results. Thus if you choose a random number seed of (say) 38, you need not worry about whether you have somehow biased the results as compared to choosing, (say) 692548236. But the sequence is deterministic. Thus if you do another run with the same parameter settings and the same random number seed, you should get the same results from DNATREE even on a different computer. This gives you the ability to duplicate any run that gives strange outcomes, if you happen to have written down the random number seed.

The rate of evolution is a positive number. It specifies what the expected numbers of substitutions per site will be per unit time. A unit of time is the time it takes, on average, for a lineage to split. This if you use the value 0.1, each site will accumulate approximately 0.1 substititions along each interior branch in the tree (but this will vary with the exact length of the branch). The terminal branches (the ones that lead to tips) will accumulate fewer changes as they do not continue until they branch.

If you choose a small value (say 0.01) of the evolutionary rate, the sequences will be very similar and differ at few positions. There will scarcely be any sites that will have undergone two or more substitutions in the course of the evolution of these sequences. But if you choose a larger value such as 10.0, you will see so many changes that there will be little similarity between the sequences, and a lot of changes whose effects overlay each other. There will then be a lot of evolutionary noise that will obscure the relatedness of the sequences.

The tree that is printed out will look something like this:


                                                       ,-1:D
  ,---------------------------------------------------19  
  !                                                    `10:G
  !  
  !                                            ,---------2:H
  !                                        ,--15  
  !                               ,-------14   `---------6:E
-11                               !        !  
  !                               !        `-------------5:F
  !      ,-----------------------13  
  !      !                        !                   ,-----4:J
  !      !                        !                ,-18  
  !      !                        `---------------17  `-----9:A
  `-----12                                         !  
         !                                         `-----8:C
         !  
         !                                     ,---------3:I
         `------------------------------------16  
                                               `---------7:B

The horizontal branch lengths are roughly proportional to the evolutionary time that elapsed in the simulation. All the lineages evolved for the same amount of time, measured from the root of the tree at the left. There was a ``molecular clock" in that all lineages existed for equal amounts of time, and all evolved at the same average rate. But note that in the above diagram, there seems to be a violation of that, as J and A stick out farther to the right than do the other species. This is because the line from node 17 to node 18 was actually very short, but it has been expanded a bit to make it possible to read the node numbers. Other small inqualities in the horizontal position of the tips can result from rounding off of branch lengths to the discrete number of columns in your page.

Important note: At the moment, DNATREE has no way of saving the original tree, except by writing it into a file in the standard parenthesis notation. This you may find hard to read, so if you want to compare your original tree to the parsimony estimate, I strongly recommend that you write the tree down by hand. You may miss it later on if you do not. On Macintosh systems you should be able to print a copy of the screen by using the Print command in the File menu.

You will be asked whether you want to write the tree out into a file. You need not do this, unless you want to take this tree and use it as input to some other program. So most people will want to answer N (No).

Simulating the evolution of the DNA. Next you will see the first 60 sites of the DNA sequences whose evolution was simulated at an imaginary locus along that tree. The sequences look something like this:

Here are the sequences that were simulated along that tree.
 A dot (.) means that the base is the same as in the first species.

Here are sites 1 through 60 out of a total of 200.

-----------------------------------------------------------------------------
A           CGTGGGACAA CGACCATCAA AGCCTCGCGG TCTCCCTCAA TCACTGGTGC CGGTTTCGCA
B           .......... .........C .......... .......... G.G......T A.....T...
C           .......... .........C .......... .......... G......... ..........
D           .......... .........C .......... .......... G.G....... A.....TA..
E           .......... .......T.C G......... .......... G......... A.........
F           .......... .........C .......... .......... G......... ..........
G           ....A.TT.. T.GTT....G ..T.AT...T .......T.C G.G.A.A.AT T.......G.
H           G......... .........C .......... .......... G.....A..T ....C.....
I           T.....T... .......G.C .......... ......G... G......... ..........
J           ....A.TTGG .AGTT....C ..T.A....T .......T.. G.G...A..T T.....A.G.
-----------------------------------------------------------------------------

The names (in this case A through J, but in larger cases they can have two letters) are the tips of the tree that you saw earlier. Each sequence is given a up to six blocks of 10 bases each. The dots (".") are to make the differences easier to detect. They mean ``the same as in the first species. Thus at the first site in the above data, all species have "C" except species H and I which have G and T, respectively. If you find the dots hard to look at, you will in a moment get the opportunity to turn them off and show all the letters ACGT at each site (at which point you'll realize why we had the dots).

The statistical model of base change that this program uses is the Kimura 2-parameter model. This was introduced by the famous theoretical population geneticist Motoo Kimura (1980). It is the simplest possible model of DNA change that embodies unequal rates of transitions (A <-->G and C <-->T) and transversions (the other changes, in which one base is a purine and the other a pyrimidine). It will lead to (approximately) equal frequencies of the four bases.

Below the data display, you will see the menu:


Choose what to do next:
  N    See the next group of sites
  D    Turn off using dot-differencing
  Y    Finish looking at sequences
Type one of these letters, then press on Enter key

The N option shows you successive groups of 60 sites. The P option does the opposite, going back to previous groups of 60. Neither is available when one is near the relevant end of the sequences. The Y option is what you type when you are tired of looking at the sequences and want to move on to the next part of the run.

The program will ask you if you want to save the data in a file. You do not need to do this, unless you want to analyze the data in another program. This program keeps track of the data internally. Thus most users will want to answer N (No) at this point.

Searching for the most parsimonious tree.

After the data has been simulated, it is time to look at possible phylogenies and see how well they fit the data. The program now has the third menu, which looks like this:


Interactive search for most parsimonious tree

Settings for this run:
  O                           Outgroup root?  No, use as outgroup species  1
  T                 Use Threshold parsimony?  No, use ordinary parsimony
  L               Number of lines on screen?  24

Are these settings correct? (type Y or the letter for one to change)

You will probably not want to change these settings. If you have more than 12 species, you will want to increase the number of lines on the screen, if these could be visible. The outgroup root at species 1 is also arbitrary, but that can be changed later.

If you type Y (Yes) and accept the values, the program then generates a phylogeny at random, and displays it. You are then able to rearrange this, watching the parsimony (or compatibility) scores of the tree, to try to find the most parsimonious (or maximum compatibility) tree by yourself. You can also check on individual sites in the sequences, to see how well they fit the tree. The tree will initially look something like this (appearances vary on different types of computers):


  ,--------------------------3:H
  !
  !  ,----------------------10:D
-12  !
  !  !              ,--------6:G   Steps: 319.00000
  !  !              !
  `-19  ,----------15     ,--9:I
     !  !           !  ,-18
     !  !           `-16  `--7:J     118 sites compatible
     !  !              !
     `-11              `-----2:F
        !
        !           ,--------8:B
        !           !
        `----------17     ,--5:C
                    !  ,-14
                    `-13  `--4:E
                       !
                       `-----1:A

NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)

The tree grows from left to right, and each node on the tree (each fork or tip) has a number, to allow us to refer to it. The names of the species are also available at the tips. In addition, the numbers of changes of state (steps) that are needed to evolve the data set on this tree are shown. The number of sites that are compatible with the tree (that perfectly fit it) is also shown. The tree that is shown is a random one, not a good estimate.

You will probably want to rearrange this tree to find the most parsimonious (or alternatively, the most compatible) tree. Thus you will want to minimize the number of Steps, or maximize the number of sites compatible with the tree. The menu at the bottom of the tree shows a number of commands that you can issue. The most important of these are:

R
Rearrange the tree. The program will ask you the number of a node to remove from the tree, and then ask you where to put the group that you have removed. For example, if you answered 17, the program removes the node 17 and all the nodes that are descended from it, and then asks you where you want to put this subtree. You might for example answer 6. The subtree will then be connected to the branch that connects nodes 15 and 6. It puts a new node into that branch, and that node has two descendants, 6 and 17. The resulting tree is then displayed, and if it is a tie or an improvement in the number of Steps the program will say so.
T
Asks for the number of a node, removes it from the tree, and then tries it in all possible places on the tree. Some lines are printed out that show which placements led to improvement, and which to ties. The group is then returned to its original position; it is up to you to use the R command to put it in the best possible position. Note that the T command involves Steps, but not compatible sites.
+, -, S, #
These commands show you the reconstruction of base changes for a site throughout the tree. + shows the next site, - the previous site, and S are particular site. The S command causes the program to ask you the number of the site you want; if you answer 0, the program returns to displaying the tree without the reconstructions. # shows the next site that is not compatible with the tree (that requires some base arise more than once). Here is a display you might see (exact appearance varies between operating systems as we use special characters to show the bases when we can):

  ccccccccccccccccccccccccccc3:H       SITE   4         
  c 
  c  ttttttttttttttttttttttt10:D   a:A, c:C, g:G, t:T, .:?
 12  t 
  .  t              ccccccccc6:G   Steps: 319.00000
  .  t              c 
  ..19  ccccccccccc15     ccc9:I
     .  c           c  cc18 
     .  c           cc16  ttt7:J     118 sites compatible
     .  c              c 
     ..11              cccccc2:F
        t 
        t           ttttttttt8:B
        t           t 
        ttttttttttt17     ccc5:C
                    t  tt14 
                    tt13  ttt4:E
                       t 
                       tttttt1:A

NEXT? (Options: R # + - S . T U W O F C H ? X Q) (H or ? for Help)

The tree's branches are now made out of a's, c's, g's, or t's, showing which nucleotide is present at site 4 on which branch. In this case the branches leading from 12 to 19 and 19 to 11 have "." indicating ambiguity as to which nucleotide is there. A little consideration will show that this site is reconstructed as having had 4 changes, though there are two alternative ways that they could be placed. On ANSI terminals (such as Xterms on Unix systems) or on PC's, reverse video and graphics characters are used to make the bases easier to see.
O
root the tree using the Outgroup node that you specify. If you specified 11, for example, the root of the tree would be placed between nodes 19 and 11. Note that the number of steps or of compatibilities does not depend on where the root is. Thus you may be able to make the tree look more like the true tree by rerooting it using this command.
F
Flips the order of branches coming out of a node. This too does not affect the score of the tree but can make it look better.
W
Writes out the tree to a tree file, in the standard Newick format (which uses nested parentheses and commas).
Q, X
These cause the program to quit. (If you are on a Macintosh system, you will also have to use the File menu in the upper left-hand part of the screen and choose Quit. You will probably not want to save the file dnatree.out).


Who wrote DNATREE

.

Much of the code was written by me, Joe Felsenstein. Large parts of the tree-rearrangement code are by Andrew Keeffe, and the C conversion of that code was done by Akiko Fuseki. This program is used by me for teaching my Genome 453 and Genome 570 courses, and may be useful to others as a teaching tool.

Thanks to Tudor Ionescu for finding a bug and fixing it.


Copyright

This program and its documentation is copyright to The University of Washington, 1999-2004. Its free distribution and use is permitted, including using multiple copies in classrooms. Resale or use in a commercial product is not permitted without our permission.


Getting PHYLIP

If you want to get the PHYLIP package, which is free, you will find it on the Web at this site, and the most comprehensive listing of available phylogeny programs will also be found at that site.


Why we did it this way

Yes, I know, we should have done it your favorite way. But consider:

1.
No, we could not easily have made it use a mouse and windows, and respond to the mouse. You see, we wanted to make it work on all operating systems. Sure, we could have written it for Windows only, or for Mac OS X only, or for X Windows only. But to write it for all three, we would have had to either license (expensively) a multi-platform windowing library, or write our own (expensive in time and effort), or use a free multiplatform windowing library such as wxWindows. To do that, we might have to distribute large chunks of it with our source code.
2.
No, we could not have just distributed one version, the one for the greatest operating system, the one that soon everyone will be using, the one that you happen to like. We just don't know which one that is. People keep changing their view on which is the one great computer system which will take over the world. They used (in the early days of my package PHYLIP) to say we should only do it for Apple II's, or for IBM mainframes! And they were very sure of themselves. A couple of years later we'd run into the same person, now advocating something else with equal stridency and not a trace of embarassment.


Literature Cited

  

Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120.

Kuhner, M. K. and J. Felsenstein. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Molecular Biology and Evolution 11: 459-468 (Erratum 12: 525 1995).


Joe Felsenstein
2008-04-22