(Previous | Contents | Next)

Bayesian-LAMARC tutorial: How?

This tutorial is designed to be read from beginning to end, but if you like you can jump straight to:

I'm ready to go! What do I do?

First off, you should have read the basic tutorial at least to the point where it links to this document (though it is wise to read the whole thing, as many of the concepts are the same). You should also read our tutorial why you might want to use a Bayesian analysis.

Now do the following:

Step 1: Use the converter to get a LAMARC infile from your data.

Step 2: run LAMARC and tell it about that infile.

For the purposes of illustration, I'm going to assume that you have at least two genetic regions in your data set, coming from at least two populations.

Now we're going to use the menu. To change from a likelihood run to a Bayesian run, first select the 'Search Strategy menu' (S), and toggle the 'Perform Bayesian or Likelihood analysis' option (P). If you're experimenting and just want to see a Bayesian run, that's all you'd need to do--you could hit '.' now and run LAMARC. But let's explore some of the other options available to us in a Bayesian run.

First, you'll notice that you now have a new menu option available to you entitled, 'Bayesian Priors Menu' (B). Select that, and you'll get a sub-menu listing the different active forces, and a summary of what the priors look like for each of them. By default, the priors for all forces but growth are logarithmic, so you'll see something like:

Bayesian Priors Menu
  T  Bayesian priors for Theta                             (all logarithmic)
  M  Bayesian priors for Migration Rate                    (all logarithmic)
     <Return> = Go Up | . = Run | Q = Quit

Hit 'T' and it'll take you to a list of all the priors for the thetas, including a default. You can then edit the default with 'D', or edit one particular prior by selecting that prior's number. This will take you to a menu like the following:

Bayesian Priors Menu for Theta for Population 1
    Priors may be either linear or logarithmic.  The lower bound for
     logarithmic priors must be above zero, in addition to any other
     constraints an evolutionary force might have.
  D  Use the default prior for this force                                Yes
  S  Shape of the prior                                                  log
  U  Upper bound of the prior                                             10
  L  Lower bound of the prior                                          1e-05
     <Return> = Go Up | . = Run | Q = Quit

As you can see, you have two ways to change the prior--you can change the boundaries, and you can change its shape (or density). The 'S' option will toggle the shape between logarithmic and linear, and you can set the upper and lower bounds with the 'U' and 'L' options. This is your opportunity to input what you already know about the parameters you wish to estimate. It's obviously important to get the units right, so be sure to read the forces section of the manual, and figure out any differences between how you typically think about your parameters and how LAMARC uses those parameters. (One typical 'gotcha' is that LAMARC always uses per-site estimates, but some researchers use per-locus estimates.)

OK, I've set my priors. Am I done?

Not quite--the standard search strategy is not really appropriate for a Bayesian run, so we're going to change it. Go to the top menu by hitting 'Return' a few times, then select 'S' ('Search Strategy Menu'), then S again ('Sampling strategy (chains and replicates)'). Once here, change the number of initial chains to 1 and final chains to 1 (options 1 and 5). Finally, change the number of replicates ('R') to 3. (In a production run, you're probably better off changing the 'Final number of samples' to 30,000 instead of changing the number of replicates to 3, but we want a bit more feedback for this sample run, which replicates will give us.)

Now am I done?

Yes! One quick thing, however: from the main menu, select S (Search Strategy Menu), then R (the Rearrangers Menu). Here's where you can change the various rearrangers, including (now) the Bayesian rearranger. This controls how much relative time is spent sampling new parameters (the Bayesian rearranger) vs. sampling new trees (all the other rearrangers). In a run where you are trying to estimate many parameters (say, in a system with several populations), this menu is where you could increase the time spent resampling from those parameters.

But for now, we'll leave it as it is, with the Bayesian rearranger set to the same relative frequency as the Topology rearranger. Hit Run ('.') and I'll walk you through the output.

OK, this 'Initial chain 1' thing is finished.

This probably looks something like:

14:33:01  Initial chain   1:  [====================]         1000 steps
14:35:47  Predicted end of chains for this region:  Fri Apr 22 08:32:28
14:35:47  Accepted    17% | Point Likelihood 2.09739689 | Data lnL -3268.84119
Trees discarded due to too many events:        2
Trees discarded due to too small population sizes:        0
Trees discarded due to an infinitesimal data likelihood:        0
Trees discarded due to extremely long branch lengths:        0
Bayes-Arranger accepted            80/421 proposals
Tree-Arranger accepted             81/468 proposals
Tree-Size-Arranger accepted         9/111 proposals
Number of unique sampled values for each parameter:
    9: Theta for population number 1
    4: Theta for population number 2
   20: Migration rate into population number 1 from population number 2
   31: Migration rate into population number 2 from population number 1
  Class                  Theta
  population number 1   0.002060
  population number 2   0.009636
  Population                     Mig
  population number 1     --------  184.1942
  population number 2     48.03229  --------
14:35:47  Final chain     1:  [|                   ]          325

Much of this is the same as in the basic tutorial, but let's revisit all the pieces anyway.

Initial chain 1
For this Bayesian run, we set up a single initial chain and a single final chain. The initial chain is not used in the final estimation of parameters, but serves (along with 'burn-in', or the discarded samples for each chain) to get the estimates away from their starting values and the trees away from the initial tree. It also gives you a rough idea of how the run is going.

Predicted end of chains
LAMARC's estimate for how long it will take LAMARC to get through all of the replicates for this genetic region.

Accepted 17%
The total acceptance rate for all the various arrangers. While this can be helpful (acceptance rates should normally fall in the 5-50% range, and typically reside around 10% or so), it is usually more helpful to examine the acceptance rates for the individual arrangers.

Point Likelihood 2.09739689
An average of each parameter's posterior point likelihood at its maximum probability. Not all that useful on its own, but it can be compared to other Point Likelihoods for other chains or regions. The higher this number, the thinner (on average) the confidence intervals, while the lower this number, the wider the confidence intervals.

Data lnL -3268.84119
The data log likelihood. This number will probably be very negative, should increase for the first few chains, then level off for the last few chains. It's the probability of the last tree in the chain given your data, which is a measure of how well the tree fits your data. Since a large data set is highly unlikely to be produced by any given tree, the low values in themselves are not a problem; but they should not decrease significantly as LAMARC's search continues.

One other thing you should note is that if you add more sites or more individuals to your data set, this number will go down. A larger data set is intrinsically more unlikely (requires us to posit more events to explain it) than a smaller one. So a tremendously negative data log likelihood is not a symptom of impending doom, just a sign of a big, juicy data set.

Trees discarded due to...
Sometimes, LAMARC will discard trees because the trees themselves are inherently too tricky to deal with. Almost always, these trees would also be rejected from having too low a likelihood, so you shouldn't worry about this too much unless one of these numbers gets very high (say, larger than 5% of the total number of proposed trees). If that happens, your starting parameters might be too extreme, or you might be calling two populations different when they are actually genetically identical (rejections due to 'too many events' can have this cause). Generally, though, it's just LAMARC being efficient.

If absolutely no trees are rejected, you'll see the message "No trees discarded due to limit violations." which means you're fine.

Arranger accepted
This is a more detailed breakdown of the 'Accepted 7.39%', above. The three arrangers on by default in Bayesian LAMARC run are the Bayes arranger, which picks a new value for one of your parameters from that parameter's prior; the Tree-Arranger, which breaks a branch of a tree and then re-attaches it; and the Tree-Size-Arranger, which preserves the topology of the tree but picks new sizes for some or all of the branches. The Bayes-Arranger and Tree-Arranger are absolutely required for a Bayesian run, since the first samples the parameters and the second samples the trees. The Tree-Size-Arranger is more of a helper function, which is why (by default) it only searches 1/5 as much as the Tree-Arranger.

These numbers can vary fairly widely, but each should normally fall in the 5-50% range, with 10% being typical.

Number of unique sampled values for each parameter:
Here, we see the results of the Bayesian acceptance rate. Each parameter is listed, together with the total number of unique points collected for it (each point may have been sampled multiple times, either because proposed new values for that parameter were rejected, or because that parameter had no new proposed values between sampling steps). For the initial chain, it's not important that these numbers be very high, but for the final chain it will be vital. The more data points you have, the better resolution of your peaks you will get, but in general, you'll need at least 100 unique data points to get an okay curve, and probably in the 1000s or greater to get a good curve.

It is important to note that these values are not 'Effective Sample Size' values (ESS). LAMARC does not calculate ESS, but provides an output file that can be used with the program Tracer to calculate ESS.

Theta and Mig
These are the parameters LAMARC is trying to estimate. You will be estimating one theta value for every population in your data, and two migration rates for every pair of populations in your data. In this case, with two populations, that means a theta for each (0.002060 for population 1 and 0.009636 for population 2), and two migration rates (184.1942 for the rate from pop2 to pop1, and 48.03229 for the rate from pop1 to pop2). These are the peaks of the posterior likelihood curves for your parameters. As we are in the Initial chain, this doesn't mean much, but these same estimates from the Final chain will be reported as LAMARC's estimates of your parameters.

OK, it finished 'Final Chain 1', then told me it wrote to a bunch of files. What are they?

The main output file will match the output file from a likelihood run; more information can be found in this section of the main tutorial, and still more information can be found in the Output Files section of the main manual. The principal difference between a Bayesian output file and a likelihood output file is that Bayesian parameter estimates are known as Most Probable Estimates (MPEs) instead of Maximum Likelihood Estimates (MLEs). This is because a Bayesian run produces probability density functions from which we read off the peak, instead of calculating likelihoods.

This difference also shows up in the profile tables of the outfile, where the Point Probabilities of the parameters are reported instead of the Log Likelihoods. Again, this is due to the type of analysis being done. Point probabilities are the absolute values of the probability density function for that parameter value. They can be used to compare their magnitudes at different parameter values, but are meaningless outside of that context.

Finally, the parameters are profiled without regard to other parameters in the run. As noted earlier, this can mask any correlations that might legitimately exist in your data, but the profiles are otherwise accurate.

There are also these 'curve files'. What do I do with them?

Curve files, as you no doubt remember, are the full detailed output from a Bayesian run, and you should definitely look at them. The easiest way to do this is to import the file into a spreadsheet program like Excel (the numbers are tab-delimited), highlight the two columns of data, and select 'make a graph of this' (you'll want an X-Y Scatter Plot type graph). You can then simply look at the resulting curve to see if it's lumpy or nicely monotonic.

So what does it all mean?

An excellent question, and one that we attempt to answer in some detail in Analyzing the Rest of Your Data.

(Previous | Contents | Next)