This tutorial is designed to be read from beginning to end, but if you like you can jump straight to:
Lamarc is a tool for studying populations. It can estimate:
First off, you need genetic data. This might be standard genetic data like DNA or SNP sequences, or microsatellite counts, or it could (with version 2.0) be observations with genetic underpinnings, like phenotypic observations or electrophoretic data. More data will result in more accurate results, but it's important to note that the addition of more individuals, after a certain point (10-15 per population), is much less helpful than the addition of more independent genetic regions. The history of a known region in an additional individual is probably similar to that of its cohorts. The history of a new unlinked region from the same (or new) set of individuals is almost certainly very different from that of the old region, and therefore more likely to hold new information.
Secondly, you need a computer. In particular, you probably need a computer with a lot of memory. A good LAMARC work-out can use hundreds of megabytes of RAM, and for particularly ambitious analyses with lots of data, gigabytes are probably called for. For learning purposes, LAMARC can run on more limited computers, if you're careful to give it a limited data set and not allow it to record too many observations.
Finally, you need some time. The total amount of time can vary from data set to data set, but it's not unusual for a complete, solid run of LAMARC to take a week or two. Larger data sets requiring full error analysis (profiling) can take a month or longer. However, it would be a mistake to throw all your data at LAMARC in January, then write to us in May asking when you could expect results. The answer would probably be a tragic, "LAMARC ran out of memory, and inexplicably failed to crash." Much better, instead, to test LAMARC on subsets of your data and get preliminary results in a day, which will then help you determine how best to analyze your full data set. This tutorial is designed to help you do just that.
One caveat: some enterprising souls, having lots of computers and little time, may think, 'Aha! I can solve my problem of having no time by running the parallelized version of LAMARC!' Which would be a great idea, if such a thing existed. There is a parallel version of MIGRATE, so if you only want to estimate effective population sizes and migration rates, that option is open to you. If your LAMARC run contains multiple regions or does multiple replicates, you can separate them into different files, run with summary files turned on and recombine them using poor man's parallelization . But if those options don't appeal to you, you're stuck waiting for single runs to finish. Sorry.
Hang on, let's take this slowly. If this is your first time using LAMARC, I strongly recommend using only a subset of your data, to get a feel for how the program works, first.
If you have multiple populations, let's select two of them to start with. If you know something about the migration patterns of the populace already, pick two populations you know share migrants. Then select around 10 individuals from each population. If you're only studying one population, pick 10-15 individuals. If you want accurate results, it's vitally important that you pick these individuals randomly from your data set, and not on the basis of them being the 'most different' or 'most interesting'. You can grossly distort your results should you rely on such a scheme.
Next, pick one region to analyze. If you have DNA or SNP data, that means one continuous stretch of sequence. If you have microsat data, that probably means just one microsatellite, though if you happen to have linked microsat data, it could be those linked microsats.
Armed with the knowledge of which data you want to analyze, it's now time to assault the converter. Lamarc 2.1's file conversion program can be used in both a batch mode and as a graphical tool. You can read about how to use the converter here.
You are ready to run LAMARC. The very first thing LAMARC asks you for is the name of a file, and the converter-created file is what it wants. You can edit the contents of that file if you wish (it's all XML), but it's probably much easier to change things in the menu. If you like, you can even enter the name of your file at the command line ('lamarc yourfile.xml'), and if you don't want to do anything in the menu, you can run LAMARC in 'batch mode' by adding a '-b' flag ('lamarc yourfile.xml -b').
You can read about the Bayesian capabilities of LAMARC in why you might want to run Bayesian-LAMARC and how you would do so. The latter picks up from right here. From here on, this tutorial focuses on the Likelihood-LAMARC path.
For a first crack at your data, yes. It'll use a lot of default values that are perhaps not the best choice for your data set, but they'll work for now. After you see how it runs, we'll get around to changing the defaults.
I'll be here.
After creating a tree, LAMARC is rearranging it to see if it can find better ones. Each time it does so, it increments the counter. Using the defaults, it first counts to zero from -1000 without sampling anything (this is the phase known as 'burn-in'), then it starts counting up to 10,000, sampling every 20th tree. Those trees will be used to estimate parameters, which you'll see in a moment.
It's busy calculating the max--
OK, your screen probably looks something like this:
15:39:34 Initial chain 1: [====================] 10000 steps 15:40:09 Predicted end of chains for this region: Tue Feb 8 16:06:43 2005 15:40:09 Accepted 7.39% | Posterior lnL 27.0287215 | Data lnL -627.243228 Trees discarded due to too many events: 2 Trees discarded due to too small population sizes: 0 Trees discarded due to an infinitesimal data likelihood: 0 Trees discarded due to extremely long branch lengths: 0 Tree-Arranger accepted 284/8268 proposals Tree-Size-Arranger accepted 455/1732 proposals Class Theta Population 1 0.039795 Population 2 0.065072 Population Mig Population 1 -------- 164.6673 Population 2 142.1566 -------- 15:40:09 Initial chain 2: [=| ] 35
Let's take a look at each piece in turn.
By default, LAMARC runs 10 'Initial' chains and 2 'Final' chains. The initial chains take 10,000 steps, and the final chains take 200,000 steps. This is LAMARC's attempt to solve a basic problem in likelihood analysis: LAMARC does a great search in the area right around the driving values, but not so great a search elsewhere. By using multiple chains, LAMARC moves to driving values as close to the true maximum as possible, so that it can find that maximum efficiently. The first 10 chains are an attempt to find the right neighborhood, and the first final chain is an attempt to narrow it down even further. The last final chain is used to actually estimate your parameters, and to provide support intervals for those estimates.
If you can see the message "Beginning profiling, please be patient", it means that LAMARC has entered a computationally-intense phase where it calculates the support intervals for its point estimates. We call this phase 'Profiling', since it provides profiles of the likelihood surface as seen from the perspective of each parameter in turn. After a while, your screen will look something like this:
03:44:15 Beginning profiling, please be patient 04:06:09 Finished profile 1 of 7. Predicted end of this set of profiles: Sat Apr 9 06:17:33 2005 04:36:23 Finished profile 2 of 7. Predicted end of this set of profiles: Sat Apr 9 06:46:43 2005 04:37:26 Finished profile 3 of 7. Predicted end of this set of profiles: Sat Apr 9 05:48:20 2005
It can take an extremely variable amount of time to finish profiling, but a good rule of thumb is 'about half again as long as it's taken so far'. LAMARC provides estimates of when it expects to be done as it finishes each profile in turn, predicting that each profile it has to go will take about as long as the the average profile so far. For an exploratory run like we've assembled here, a few hours is not unreasonable.
If you have more than one genetic region in your data, it first goes through the chains for the first region, then calculates the profiles for that region, then goes on to the next region. So what you're seeing is the beginnings of LAMARC calculating the data from the second region. Back when we were getting started, I suggested only using data from one region, and this is why--each region goes through the same process, which can be rather lengthy. But, no harm done--it just means a longer wait.
Let's go through the outfile a section at a time. More information on outfiles can be found in this section.
This first section contains the bulk of the most critical information about your parameters: their best estimates and their support intervals. Your outfile might look something like:
Theta | Migration Rate Population Theta1 Theta2 | M21 M12 Best Val (MLE) 0.005697 0.002163 | 716.8337 1478.605 Percentile | 99% 0.005 0.003973 0.001279 | 213.8190 551.3366 95% 0.025 0.004314 0.001424 | 301.2110 718.4079 90% 0.050 0.004503 0.001518 | 355.3201 816.1701 75% 0.125 0.004824 0.001706 | 451.1927 987.5201 50% 0.250 0.005162 0.001903 | 553.9211 1174.315 MLE 0.005697 0.002163 | 716.8337 1478.605 50% 0.750 0.006309 0.002432 | 902.0924 1832.927 75% 0.875 0.006793 0.002636 | 1047.330 2115.421 90% 0.950 0.007349 0.002867 | 1211.883 2438.987 95% 0.975 0.007735 0.003025 | 1324.255 2661.947 99% 0.995 0.008568 0.003364 | 1561.403 3136.894 Theta1: Theta for Pop1 Theta2: Theta for Pop2 M21: Migration rate into Pop1 from Pop2 M12: Migration rate into Pop2 from Pop1
There are four parameters LAMARC is estimating: two Thetas (population sizes) and two migration rates. The estimate can be read across the first line of numbers: the theta for population one is .005697, and is .002163 for population two. The migration rate estimate into population one is 716.8337, and is 1478.605 for the rate into population two.
The rest of the table is a result of choosing percentile profiling, and is what LAMARC was doing after the end of 'Final Chain 2'. Each line represents a percentile, or, values for which the true value of a parameter is that percentage likely to be lower than. So, for example, since the .025 percentile for Theta 1 is 0.004314, the true value for Theta 1 is 2.5% likely to be lower than that value, and 97.5% likely to be higher than that value. To find the 95% support interval, then, we take the lower 2.5% and the upper 2.5% percentiles, and report the spread between them--in this case, LAMARC is 95% certain that Theta 1 lies between .004314 and .007735. Similar confidence intervals can be found using this table for the 99%, 90%, 75%, and 50% confidence intervals, and can give you a better idea of LAMARC's picture of the data.
In this section, you are given many more details about the profiles listed in the 'Maximum Likelihood Estimates' section. The first section will always be "Overall Profile Tables", and since in this example we only used one genetic region, that's all we have. If we had used more than one region, we would start with this "Overall Profile Tables" again, which would be followed by "Regional Profile Tables", containing the same sort of information considering each genetic region separately in turn.
The first swath of data will look something like this:
Log Likelihoods: Percentile Theta1 | Ln(L) 0.005 0.003973 | -3.158082 0.025 0.004314 | -1.761258 0.050 0.004503 | -1.193397 0.125 0.004824 | -0.502098 0.250 0.005162 | -0.068077 MLE 0.005697 | 0.159471 0.750 0.006309 | -0.068013 0.875 0.006793 | -0.502098 0.950 0.007349 | -1.193396 0.975 0.007735 | -1.761353 0.995 0.008568 | -3.157959
The first two columns we have seen before, in the first section. Now, however, we are given the actual values for the posterior log likelihoods when that parameter (Theta 1, in this case) is constrained to be held at each particular value. The log likelihoods here are actually what determine the percentiles, instead of the other way around. If you assume a gamma distribution of probability, and know that your point of maximum likelihood has a log value of 0.159471 (as it does in the above table), then the points at which the likelihood has a log value of approximately -1.761 will be your .025 and .975 percentiles.
The next section gives us information about parameter correlation:
Best fit parameters with Theta1 held constant: Percentile Theta1 | Theta2 M21 M12 0.005 0.003973 | 0.002155 722.4041 1475.974 0.025 0.004314 | 0.002157 721.1656 1476.671 0.050 0.004503 | 0.002158 720.5119 1477.008 0.125 0.004824 | 0.002159 719.4664 1477.540 0.250 0.005162 | 0.002161 718.4165 1478.015 MLE 0.005697 | 0.002163 716.8337 1478.605 0.750 0.006309 | 0.002165 715.0486 1479.142 0.875 0.006793 | 0.002166 713.6108 1479.514 0.950 0.007349 | 0.002167 711.9080 1479.921 0.975 0.007735 | 0.002166 710.6590 1480.219 0.995 0.008568 | 0.002164 707.7885 1480.974
Once again, the first two columns are repeats of what we've seen before. But now, we get to see what happens to the rest of the parameters if we modify the first one. The basic idea here is that we hold Theta1 to a particular value, then allow our other parameters to vary as we find a new point of joint maximum likelihood (and it is this maximum likelihood that we report in the above table). In other words, LAMARC asks the data, "OK, I know that the best value for Theta1 is .005697. But if I was wrong, and the best value was at .003973, what would Theta2, M21, and M12 be?" In this case, the answer is "Not very different." There's a slight correlation between Theta1 and all three other parameters, but it's not very strong. Contrast the above table with this, taken from the same example file:
Percentile M21 | Theta1 Theta2 M12 0.005 213.8190 | 0.005733 0.001852 1545.118 0.025 301.2110 | 0.005727 0.001907 1525.316 0.050 355.3201 | 0.005721 0.001960 1508.983 0.125 451.1927 | 0.005711 0.002050 1487.323 0.250 553.9211 | 0.005704 0.002113 1479.440 MLE 716.8337 | 0.005697 0.002163 1478.605 0.750 902.0924 | 0.005692 0.002191 1480.951 0.875 1047.330 | 0.005689 0.002203 1482.800 0.950 1211.883 | 0.005685 0.002213 1484.520 0.975 1324.255 | 0.005683 0.002218 1485.435 0.995 1561.403 | 0.005678 0.002226 1486.743
Here, we see how varying the migration rate M21 affects the estimation of the other three parameters. In contrast with varying Theta1, which only caused variation in the third or fourth significant digit, we can see here that varying M21 causes variation in the second or third significant digit of the other parameters. Again, not huge, but more noticeable.
Sorry. OK, here's the basic questions these sections answer:
Since this is clearly too many digits of precision, you might round off to:
If you thought this was confusing and were willing to sacrifice precision, you could average this error to:
even though it's, you know, wrong. Also keep in mind that results from likelihood LAMARC are believed to estimate too-narrow confidence intervals due to its reliance on the driving values. Runs with replication get away from this problem.
Gladly! Actually the next bits don't need a lot of explanation--the next section ("User Specified Options") is a summary of your starting values, search strategy, and filenames. The next section ("Data summary") is an overview of what your data looked like, what data models you used, and an actual copy of the raw data. And the final section ("Run Reports by Region") is a copy of most of what was sent to the screen during the LAMARC run (sans the progress bar, of course).
Now you have decisions to make, and since these decisions will be based on your data, which I don't have with me just now, so you're going to have to make the majority of these decisions on your own. If you're up for another question and answer session, though, skip ahead to Analyzing the Rest of Your Data. Or if you're keen to discover how to do a Bayesian run, read on.(Previous | Contents | Next)