## Glossary

This article gives informal definitions of terms used in the LAMARC documentation. For more precise definitions, consult our published papers.

• Bayesian analysis. In the LAMARC context, an analysis which places priors on the population parameters and then samples both possible trees and possible parameters. It reports the relative probabilities of the range of parameter values which it visited. The alternative is a Likelihood analysis.

• Brownian-motion model. A mathematical approximation of the stepwise model of microsatellite mutation. It will run much faster, but may be inaccurate if your data are nearly invariant.

• Burn-in. Discarding some steps of the sampler before beginning to record any. This is done because early steps may be highly atypical. For example, if the sampler starts with a deeply unreasonable tree, the first few trees it produces will also be unreasonable and probably should be discarded.

• Chain. LAMARC grinds for a while producing trees, and then makes an estimate based on those trees. It may repeat this cycle many times, especially in a likelihood run. Each such cycle is a "chain." "Initial" chains are used to get reasonable starting values of the parameters, and are generally shorter; "final" chains are used to obtain the actual parameter estimates, and are generally longer. (A Bayesian run is often just one huge final chain.)

• Coalescence. Kingman developed the idea of looking at a genealogy going backwards in time from modern individuals to their ancestors. He named the resulting pattern the "n-coalescent" (later writers have dropped the "n-"). So a coalescent tree is the backwards pattern of relationships among modern sampled individuals, and a coalescence is a point at which two of those individuals have a common ancestor. In LAMARC, we sometimes use "coalescence" to refer to the effect of Theta on the shape of the coalescent tree, as opposed to the effects of migration, recombination, growth, and so forth. The proper name of this evolutionary force is "genetic drift."

• Curve file. In a Bayesian run, a file containing data points from the probability density function that results from collecting the parameters which the chain has visited. It can be read into a spreadsheet (Excel) or math/stat package (Mathematica) to produce a picture of the Bayesian posterior probability curve.

• Data likelihood. The probability of your sequence data given a specific tree (genealogy). This is also computed by maximum-likelihood phylogeny algorithms such as PHYLIP's DNAML. Usually expressed as a log-likelihood because it's so small; the resulting numbers are negative, and are improved by moving closer to zero.

• Data uncertainty

Lamarc runs on nucleotide data now accommodate modeling data uncertainty. The per-base error rate gives the rate at which each single instance of a nucleotide should be assumed to have been miss-called. A value of 0 indicates that all were sequenced correctly. A value of 0.001 indicates one in one thousand is incorrectly called. Note that this is different from using the IUPAC ambiguity codes as it privileges one (or more) data values over another. It may be used simultaneously with the IUPAC codes. As of December, 2009 the per base error rate must be set to a single value for each segment. See the data uncertainty segment in the Instructions for using the LAMARC menu documentation for more information.

• Divergence. Splitting of an ancestral population into two descendant populations. There may optionally be ongoing migration between the descendants after the split.

• Effective population size. A theoretical concept that converts a real population into an idealized Wright-Fisher population. The effective population size is the size of a Wright-Fisher population that would have the same amount of genetic drift (same Theta) as our real one. Most real populations have an effective population size much smaller than their census size due to factors like non-breeding individuals, overlapping generations, and unequal reproductive success between genders.

• Epoch time. In Lamarc, an "epoch" is the period of time between one population splitting event and the next, in a model with divergence. The parameter reported for Lamarc epoch times is the time of the population split in generations times the mutation rate per site in mutations per generation, counting backwards from the present (when the samples were collected). Note that it is a time (the time at which the populations split) and not a length of time (the length of time between one population split and the next). If you wish to know the split time in generations or years, you will need an external estimate of the mutation rate.

• F84 model. Evolutionary model proposed by Joe Felsenstein (in 1984) in which the frequencies of the four nucleotides and the ratio of transitions to transversions may be varied. Several simpler models such as Kimura 2-Parameter and Jukes-Cantor can easily be expressed as subsets of F84. (They can also be expressed as subsets of GTR but this will be much slower.)

• Growth (g). The parameter governing the exponential growth model used by lamarc. Theta at any time t (where t is measured increasing into the past, and is in "mutational units" where one unit of time is the expected time until a site mutates once) is equal to modern-day Theta times exp(-gt). Positive values of g thus indicate a growing population and negative values a shrinking one. Interpretation of the actual value requires knowledge of the mutation rate, because of the use of mutational units in defining time. However, even without this, g values from organisms of similar mutation rate can be compared.

• GTR model. The General Time-Reversible Model, the most general easily tractable model of nucleotide sequence data. Can be used to emulate simpler models, but if a model can be expressed as a simplification of F84 instead it will run faster.

• Haplotype. A collection of markers known to come from the same chromosomal copy in an individual. See "Phase."

• Heating. Improving the search performance of a Markov chain Monte Carlo sampler by adding additional searches which see a smoothed-out or "heated" version of the search space. We refer to each additional search as a "temperature". This method is also known as 'Metropolis-coupled Markov chain Monte Carlo', or MCMCMC, or MC3.

• K-Allele model. A mutational model which assumes that there are K states which a marker can be in, and change from any state to any other is equally probable. Jukes-Cantor is a K-Allele model (with K=4) for DNA data (though in LAMARC you will need to use an actual DNA model to get the same effect). The K-Allele model is suitable for data where we do not know the pattern of mutation, such as presence/absence of a chromosomal modification, or for elecrophoretic mobility data. It may also be useful for microsatellites which do not fit a stepwise model well.

• Likelihood analysis. In the LAMARC context, an analysis which collects trees at a particular driving value of the parameter(s) and then uses those trees to make a likelihood curve for other values of the parameters. The alternative is a Bayesian analysis.

• Locus. In ordinary genetic terminology, a gene or other defined piece of chromosome. In LAMARC, we use the terms segment and region to indicate sections of genetic information that would be referred to in other places as 'loci'. We try to be consistent in using our terms, but may have slipped on occasion--it's likely we mean 'segment' if we accidentally said 'locus' somewhere.

• Marker. A position along the chromosome for which we have collected data. This is in contrast to "site", which is any position on the chromosome within the area we are considering, even if we didn't collect data for it. SNP data involves choosing a few interesting markers out of a much larger collection of sites.

• Markov chain Monte Carlo. A strategy for integrating functions which cannot be solved directly, such as finding the probability of sequence data given population parameters. "Markov chain" means that each step of the search depends only on the previous step, and "Monte Carlo" means that random choices are used rather than, say, a systematic grid search. Abbreviated MCMC.

• Maximizer. The part of LAMARC which analyzes stored trees or parameters to determine the best estimate of each parameter. Can be used in conjunction with profiling to estimate error bars for these parameters.

• Maximum likelihood estimate (MLE). The set of parameters that maximize the posterior likelihood (the sum over sampled genealogies of the probability of the data given the genealogy and the probability of the genealogy given the parameters). In other words, this is the best solution found by a likelihood run. Contrast with the most probable estimate, the best solution found in a Bayesian analysis.

• Migration rate. LAMARC estimates immigration rates (movement of breeding individuals into a population) and records them as M=m/mu, where m is the chance of immigration per individual per generation, and mu is the chance of mutation per site per generation. If you are interested in 4Nm instead, multiply our estimate of M by our estimate of Theta for the recipient population.

• Mixed KS model. A microsatellite mutational model that assumes some proportion of changes are among random states (like K-Allele) and the remainder are among adjacent states (like Stepwise). May be a better fit than either model alone for some microsatellite data. LAMARC allows you to specify the proportion of stepwise to multistep changes or to allow the program to try to optimize this ratio as it runs. The parameter percent_stepwise is the proportion of stepwise changes; percent_stepwise=0 is the K-Allele model and percent_stepwise=1 is the Stepwise model.

• Most probable estimate (MPE). The highest point on the posterior probability curve for a given parameter. Effectively, the point that fell closest to the sampled parameter the highest number of times (think of the tallest bar in a histogram). In other words, this is the best solution found by a Bayesian run. Contrast with maximum likelihood estimate, the best solution found by a likelihood analysis.

• Mu. Neutral mutation rate per site or marker (depending on data type) per generation. If you have multiple data types, the relative mu rate must be set for each to ensure accurate co-estimation of your parameters. Please be careful in comparing LAMARC runs with results from other methods, as these often report mu per segment rather than mu per site.

• Parameter. In LAMARC parlance, a parameter is a numerical quality that describes an aspect of a population or set of populations. LAMARC can co-estimate effective population sizes (Theta), migration rates, the recombination rate, and growth rates for one or more populations.

• Phase. Phase is information about whether or not several variants seen in the same individual are on the same chromosome of that individual, or different chromosomes. (In other words, whether they are part of the same haplotype.) If data is of "known phase" then we know which variants group together on the same chromosome; if it is of "unknown phase" we only know which variants are present, but not how they are allocated among the gene copies. Another way of saying this is that haplotype data is phase-known and genotype data is phase-unknown.

• Point probability. The height of the posterior probability curve at a given value of an estimated parameter. Point probabilities can be compared to other point probabilities on the same curve (for example, to find the most probable estimate), but are otherwise not as useful as integrating a section of that curve.

• Population. A group of organisms that are more or less freely interbreeding among themselves and isolated from other groups.

• Posterior likelihood. The probability of the observed genealogies at their best (MLE) parameter values, divided by their probability at the values which produced them ("driving values"). This is quoted because it can be useful in diagnosing a poorly performing run; if it is very large, the driving values were poor ones and the run should be extended until better ones can be found. It can not be used in a likelihood ratio test between runs, because it is a ratio of two independent likelihoods, neither of which we can actually compute on its own. Posterior likelihoods are produced in a likelihood analysis.

• Posterior probability curve. A probability density function that describes the relative probabilities that the true value of a parameter is a particular value. Useful when comparing the relative probabilities of two or more possible values for the parameter, or when integrating to find the total probability that the true value of the parameter is within a range of values (as during profiling). Posterior probability curves are produced in a Bayesian analysis, and exported as curvefiles by LAMARC. See also Point probability.

• Prior. A LAMARC prior for a Bayesian run is a curve that describes your prior knowledge of the possible range of values for a given parameter. LAMARC uses only 'flat' priors, giving the parameter an equal chance to be anywhere within the allowed range, but the density of those priors may be either linear or logrithmic, depending on how the parameter is expected to vary.

• Probability Density Function. A curve that describes a probability distribution in terms of integrals (the area under a probability density function must be 1.0). More informally, it can be viewed as a 'smoothed out' version of a histogram.

• Profiles. A picture of the uncertainty in LAMARC's results. In a likelihood run, profiles are produced by holding one parameter constant at a series of different values, and for each value maximizing all other parameters. This is like taking a slice through the multi-dimensional surface to reveal the landscape with respect to one parameter. In a Bayesian run, profiles are produced by showing the probability curve for one parameter at a time; no information is available about how other parameters co-vary with the chosen one. In either case, percentile profiles show the results at percentiles of the distribution (for example, the 95% marks) while the less expensive fixed profiles show the results at arbitrary points chosen in advance.

• Recombination rate (r). LAMARC measures recombination rate as C/mu, where C is the rate of recombination per inter-site link per generation, and mu is the mutation rate per site per generation. Be careful in comparing this with other estimates, which are often of 4NC, or per-locus rather than per-site.

• Region (or 'Linked Segment Region'). A stretch of linked markers along a chromosome. The usual term is "locus", but a LAMARC region can contain what a geneticist might consider multiple loci (e.g. coding regions for several genes) as long as they are all linked and their map is known. In the program and in our documentation, we usually refer to this kind of region as a "linked segment region", because our regions can contain multiple linked segments.

• Replicate. An internal repetition of a LAMARC analysis used to produce a more refined result, and to measure whether the run is long enough to produce consistent results. In a likelihood analysis, when multiple replicates are run, their results are combined via Geyer's reverse logistic regression method to produce a joint estimate that should be more accurate than the individual ones. In a Bayesian analysis, when multiple replicates are run, the resulting posterior probability curves are simply averaged.

• Runtime report. The on-screen report showing each chain and its acceptance rates, parameter estimates, times, and so forth; also reprinted as the last entry in the output file.

• Segment (or Coherent Segment). A contiguous stretch of sites containing markers of all the same data type. Multiple segments can be linked together within a region, even when the data types differ.

• Site. A position along the chromosome in the area under study, whether or not you have observed any data at that position. Contrasts with "marker", which is a site at which you have observed data. Sites are important because we must consider every site as a possible location for a recombination, whether or not it is a marker; the total chance of recombination is proportional to the number of sites.

• Stepwise model. A mutational model that assumes mutations happen between adjacent states. It is generally used for microsatellites, and in this case the assumption is that microsatellites increase or decrease by one repeat unit at a time. (It may still work well even if this assumption is occasionally violated; if you believe it is often violated, the K-Allele or Mixed KS models may be more appropriate.) The Brownian model approximates Stepwise and runs much faster, but may break down if population sizes are very small.

• Theta. Population parameter controlling the amount of genetic diversity in a population, and equal to two times the neutral mutation rate per site times the number of heritable gene copies in the population. For a diploid population, the Theta for nuclear DNA is equal to 4N(mu).

• Tip. A single observed sequence; generally a haplotype, though if you do not know the phase of your data you can make fictional haplotypes to use as initial tips. Called "tip" because it appears at one of the tips of the genealogical tree.

• Tree. An informal word for the coalescent genealogy which relates the sampled sequences. Formally, genealogies with recombination should be called Ancestral Recombination Graphs, but we frequently call them trees even though they are not strictly tree-like.