## Search Strategies for LAMARC

To use LAMARC effectively you must decide on a search strategy.

Your first choice is whether to do a likelihood-based search and analysis or a Bayesian-based search and analysis. The Bayesian search is a new option in version 2.0, and comes with its own set of issues, but setting the search parameters for a Bayesian run is much simpler. These are discussed below, after a discussion of search strategies for a likelihood analysis.

The basic unit of the LAMARC search is a "chain", a sequence of genealogies made using the same working values of the parameters. Basic chain control involves setting the length, number, and kind of chains that are run. Two advanced techniques, heating and replication, are available to improve results in difficult cases.

For each set of techniques, we provide an overview of how they work, followed by concrete advice on what values to use.

## Likelihood-based Search Strategies

### Basic Chain Control

Basic chain control involves setting the number, length, sampling interval, and initial discards of each chain.

The fundamental reason for running more than one chain is that the Metropolis-Hastings sampling algorithm is inefficient and possibly biased if its driving (starting) values are too far from the true values. We can try to pick good driving values, but it is also useful to let the program itself improve its driving values. We do this by running multiple chains, starting each one with the results of the previous one.

LAMARC provides two kinds of chains, "initial" and "final," to support a strategy of several brief initial chains to get driving values, and then one or two much lengthier final chains to narrow in on the final estimate.

The length of a chain controls how much it will be able to refine its estimate. The number of chains controls how many chances the program has to change its starting values.

We do not normally use every genealogy from a chain to construct the parameter estimates. Since successive genealogies are very similar, using all of them would waste time and memory. Instead, we sample at intervals. The larger the sampling interval, the more information each sample contains (because they are more independent).

To reduce the influence of the starting genealogy, it is possible to discard, without sampling, the first few genealogies of each chain (also called "burn-in"). We recommend doing this.

#### Advice on Basic Chain Control

In both the initial and final chains, a reasonable sampling interval is one that has at least one accepted genealogy per sample. Thus, if your run accepts only 5% of proposed genealogies, your sampling interval may as well be 20. Sampling more often will only lead to sampling the same genealogy over and over. (Recall that each Markov chain evolves by rearranging the latest genealogy and deciding whether to accept this rearrangement or retain the old genealogy.) We tend to use 20 as a standard value for sampling interval, but if your acceptance rate is very low, a longer interval may be preferable.

It is probably wise to set burn-in (discard) to discard the first 5% of each chain, especially in cases with many populations. The early genealogies can be very unreasonable. Burn-in is more important for initial than for final chains.

We have found that a good general strategy is to run 5-10 fairly short initial chains. It is not worthwhile to make them very long, as they are only being used to get a rough idea of the parameters. Once the initials chains have established good starting parameters, one can run 1 or 2 final chains which are 10x or 100x longer, and can therefore give a more precise estimate with more accurate error bars.

How many initial chains are needed? Since their purpose is to reach good starting values of the parameters, there should be enough of them that the parameter estimates have stabilized. A symptom of too-few chains is parameter estimates that are still changing directionally at the end of the initial chains:

chain 1: Theta = 0.0100

chain 2: Theta = 0.0157

chain 3: Theta = 0.0210

chain 4: Theta = 0.0248

It seems likely that if more chains were run, the estimate would continue to increase. You should run enough chains that the estimates appear to be varying around a point, rather than continually increasing or decreasing.

How long does an initial chain need to be? If it is too short, it will not be any help in finding better parameter estimates. One symptom of too-short chains is estimates that leap wildly from chain to chain:

chain 1: Theta = 0.0100

chain 2: Theta = 0.2301

chain 3: Theta = 0.0047

chain 4: Theta = 0.0599

Estimates of recombination and migration rates do jump around more than estimates of Theta, but you should be suspicious of estimates that change by orders of magnitude. This probably means that the chains sample so few trees that they get a lopsided view of the likelihood surface.

The actual number of steps needed per chain will depend on your data set; data sets with few individuals, highly variable sequences, and few parameters will stabilize more quickly than others. Please note that the default values in LAMARC are definitely on the short side. We did this to avoid anxious email from new users saying "Nothing is happening," but you should probably increase the settings, and should certainly not decrease them.

You can also use the posterior log-likelihood value ("Posterior lnL") given by the LAMARC progress reports (and repeated in the output report if you ask for "verbose" output) to diagnose too-short or too-few chains. This value should be no more than 2-3 times than the number of parameters you are estimating. For example, in a 3-population case with migration and recombination you are estimating 10 parameters (3 Thetas, 6 migration rates, 1 recombination rate). If the posterior log-likelihood is much greater than 20 or 30 even in your final chains, you should try increasing the number or length of chains. (In complex migration cases you may never succeed in getting the posterior log likelihood to decrease, so this is not an absolute rule, but extremely high numbers--10 times the number of parameters or more--are definitely cause for concern.)

Please do not compare posterior log-likelihood values between runs, or use them for likelihood ratio tests. They do not have any absolute meaning; they only show how much better the ending values were than the starting values. The particular value has no bearing on whether the model is a good one. For information about the probable error of your estimates, you should consult the confidence intervals and the profile likelihood tables in the output report.

One final number to observe in setting your chain lengths is the acceptance rate -- the proportion of proposed trees which are retained rather than discarded. If this is very low, your chains are not moving around in the search space (they are "stuck"), and you will need to use much longer chains to get good results. It is also a good idea to consider heating, a tactic discussed under Advanced Chain Control. An acceptance rate below 1% is certainly a problem, and below 5% is worrisome.

Two additional tools are available for improving your estimates, especially in difficult cases such as migration models with many populations. Replication creates several replicates of each chain, using different random starting points, and can help when the estimates are highly variable from chain to chain. Heating supplements the search process with additional, more adventurous searchers who can report back on good genealogies that they find, and can help when the search tends to remain "stuck" near its starting value or when the acceptance rate is low.

Replication involves repeating an entire set of chains several times using different starting genealogies, and then combining the results (using the algorithm of Geyer 1991). If you run the program several times and its answers are not consistent, replication can help. It is also useful when estimates vary wildly from chain to chain. Running N replicates will slow the program N-fold, plus some additional slowdown to construct the combined result. (In the long run we hope to allow multiple replicates to be run on different processors of a multi-processor machine.)

One advantage of replication is that it produces more accurate error bars. The LAMARC algorithm assesses the likelihood curve very precisely near the values where the chain is run, but less accurately elsewhere. Error bars based on a single chain may therefore be inaccurate (usually too narrow), since they are based on the curvature far from the maximum. Combining several replicates gives a broader region of accuracy for the curve.

Heating, or MCMCMC (Metropolis-Coupled MCMC or "MC cubed"), is a more radical change in the search strategy. It involves splitting each chain into several searches running at different "temperatures." One search, run at a "cold" temperature, explores the normal likelihood surface. The others, run at "hot" temperatures, explore flattened-out, "melted" versions of the surface. This enables them to search more adventurously, but would produce distorted parameter estimates, so we do not use the hot results directly for estimation. Instead, we allow them to swap good genealogies into the cold search. In this way, the cold search has access to possibly-good genealogies that it might otherwise never find. If you run more than one "hot" search, these "hot" searches can be run at different temperatures, and these can swap genealogies among themselves as well.

If you use N different temperatures, the program will slow down approximately N-fold. (Someday we hope to allow multiple replicates to be run on different processors of a multi-processor machine.) However, in cases where the search is performing badly, we find that a heated run with three temperatures is often much more successful at getting good estimates than an unheated run of triple length.

You can set the "static" option and determine the temperatures of the heated chains yourself, or use the "adaptive" option which allows the program to adjust the temperatures as it runs. The adaptive scheme adjusts the upper temperature downwards if the swapping rate falls below 10% and adjusts it upwards if the swapping rate exceeds 40%. The rationale is that too little swapping means that the search is stuck, whereas too much swapping indicates that the chains are all searching the same space and therefore not contributing usefully. Temperatures are re-evaluated after each chain has had approximately one chance to swap (which may be too often; we don't have much experience with adaptive heating yet).

Consider replication if your estimates vary wildly from one run of the program to another or from one chain to the next, and if lengthening the chains is not helping much. Replication is also strongly advised if you need your error bars to be highly accurate; results based on a single set of chains may have error bars that are narrower than they should be.

Consider heating if your parameter estimates do not move away from their starting values, and if this is true for several different sets of starting values. Also consider heating if your acceptance rate is very low (definitely if it is below 1%, and probably if it is below 5%). Heating may also help with the same problems that replication does.

Consider both replication and heating if both of these are true; the program will take a long time to run, but it's your best chance of getting good estimates. It never hurts to try replication and/or heating if you have enough computer time. They should never make the estimates worse.

Three replicates seem to work well unless the estimate is terribly unstable. Each chain will still have to be adequately long, although you can be a little more tolerant of values that leap around from chain to chain. Replication does not decrease the number of initial and final chains required to get a good estimate.

For heating, our limited empirical experience recommends three or more temperatures. The optimal temperatures seem to depend heavily on the data and force model. 1, 1.1 and 1.3 may be good first choices, but it might be necessary to include really hot temperatures if the run still gets stuck. It is important to check whether the chains are able to interact with each other; the chains will stop interacting when the temperature differences are too large. For difficult problems one might need to insert many heated chains, for example with temperatures: 1.0, 1.1, 1.3, 1.6, 3.0, 6.0, 12.0

Note that the final estimate is based on results from the cold search, so the cold search must be at a temperature of 1.0 for correct results. Temperatures below 1.0 are not allowed. No two temperatures should be the same, as the duplicate is wasted.

Adaptive heating is a new and experimental idea. Try it if you have plenty of time to experiment; stick to static heating if you need results quickly.

The LAMARC progress reports include a table of swapping rates which indicate how often genealogies are exchanged between searches. Try to keep this number in a moderate range (10% to 40%). If more than half of the genealogies are swapped, your hot searches are not hot enough--they are exploring the same areas as the cold search. If almost none are swapped, the temperature differences between the chains are too large, and the significant computation effort spent is wasted.

Heating may allow you to decrease the length of each chain. You will probably still need as many initial and final chains as before.

## Bayesian-based Search Strategies

Unlike a likelihood-based search, a Bayesian search is not dependent on driving values during its search. As a result, the strategy of using multiple chains to get the best driving values is not nearly as helpful. In fact, we recommend that you perform only one long chain, instead of the several we recommend if you are doing a likelihood run.

The easiest way to do this is to set the number of initial chains to zero, and the number of final chains to one. Then increase the number of samples, the sampling interval, and the number of samples to discard by about a factor of two each. The increase to the sampling interval is useful because LAMARC is now dividing its time between re-sampling trees and re-sampling the parameters, and so needs twice the time it used to need to visit the same number of trees. The increase in the number of samples may not need to be increased by as much as a factor of two, but doubling this will approximate the amount of time that would be spent on all the initial chains in the likelihood setup. Increasing the number of discarded samples gives you a little more leeway in getting away from the initial de-novo genealogy to ones that better fit the data.

Examining the curvefiles produced by a Bayesian run should tell you more about whether the search parameters you have chosen need to be longer, or if you can get by with shorter searches. A curve with multiple peaks probably needs to be run longer, and might also need larger intervals between samples, particularly in LAMARC runs that attempt to estimate many parameters.

Both of the advanced techniques above (replication and heating) can be used in a Bayesian run, but heating is likely to be much more helpful than replication. Heating decreases your chance of getting 'stuck' in particular areas of tree-space, but keeps the ability to compare different areas of tree-space directly. Replication in a likelihood run is designed to mitigate the effects of particular driving values, but a Bayesian run doesn't have analogous driving values, so a search here merely investigates tree-space starting from a different initial position, without the advantage of being able to compare trees directly. Heating is therefore directly advantageous in a Bayesian run, whereas the only practical use of replication is to give you more feedback on whether your search was adequate. A generously adequate search will be similar among replicates; if the replicates are very dissimilar, either the run is too short or the data are not informative.

The Tracer program of Drummond and Rambaut can be very useful in assessing whether your Bayesian search has run long enough, when used in conjuction with Tracer output from LAMARC (available as of version 2.0.3). A trace, for any parameter, which is still rising or falling systematically at the end of the run indicates a too-short run. Ideally, the rising or falling portion of the graph should be over before the end of burn-in, and by far the majority of the run should be exploring a plateau.

Unfortunately, good-looking Tracer results do not guarantee a sufficiently long run, as the program may be exploring a plateau which does not contain the maximum. There is probably no way, even in principle, to prove that the search has been sufficiently comprehensive, other than to use an exhaustive search (impractical for any but the smallest data sets). Bad-looking Tracer results, on the other hand, are reliable indicators of a bad run.