Monday, February 28, 2011

Parsimony Tree for I1 clade

The PHYLIP Package (code is here ) has many tools for constructing phylogenic trees from genetic data. One such program is called DNAPENNY. This program constructs a parsimony tree. A parsimony tree is the tree that results in a given dataset using the minimum number of mutations. Since the mutation rates are small for Y-DNA STR data and even smaller for SNP data, the tree (or trees) with the fewest required mutations will generally be the correct tree. PHYLIP has many other useful programs but we will just discuss parsimony trees here.

The trouble with PHYLIP is that it is made to be used for regular DNA sequences; that is strings of GATC characters not microsatellite numbers. David Hamm suggested that we should convert the microsatellite data to GATC format. He also provides a program for Windows computers. I don't have Windows so didn't try his program but I wrote my own version.

The conversion algorithm looks like this. First, lets ignore unequal mutation rates and SNP data which we will discuss in a moment.

Go through each column (or allele) for a set of haplotypes (e.g. people) and find the minimun and maximum value. If they are the same, you can ignore this column since they are all the same and so non-informative from a parsimony perspective. Assume they are different. For example, the minimum might be 13 and the maximum might be 16. This then requires three GATC characters. In general, if the difference is N, it requires N-1 characters. It doesn't matter which characters you use. I use A to mean a count of zero and G to mean a count of 1 so that 13 would be AAA, 14 would be AAG and 15 would be AGG and 16 would be GGG. Now you can see that the number of A-G transitions is in direct correspondence with the difference between the markers.

Missing data can be handled by almost all PHYLIP programs. Missing values are denoted by the X character so that if this column is missing for some haplotypes, it codes to XXX (or however many X are required for this column).

Finally, we need to address how to treat varying mutation rates and also SNP data. We treat SNP data the same way as STR data except that it is either on (G) or off (A). We also should use mutation rates for SNPs that are much, much smaller than STRs so that they take preference in the parsimony criteria. This brings us how to treat unequal mutation rates.

Assume for the moment that we have just two different mutation rates: the fast one, say 0.0100 and the slow one, say 0.0025. Note that the fast one is 4 times faster. It should occur roughly 4 times as frequent. So basically we would like to give the slow allele about 4 times as much weight. One mutation at the slow allele should be roughly as parsimonious as four mutations at the fast one.

This is easy enough to do. Just multiply the data in the slow column by 4. So now, a change by one marker in this column shows up as four mutations and therefore receives the correcting weighting. What makes this correct mathematically speaking is that (1-mu)^4 is approximately equal to 1-4*mu when mu is a small number. (This could be made more exact.)

To generalize this, find the fastest mutation rate, MU_FAST, and set the column weight to MU_FAST/mu, rounded to the nearest whole number. This rounding business introduces some inexactness into the process but mutation rates are not know that accurately anyway. SNPs have much, much smaller mutation rates than STRs and if we used the exact values, we would get enormous weights for those columns resulting in very long strings of GATC characters. In practice, I find that using mutation rates for SNPs that are only a few times smaller than the smallest STR rate works well enough to guarantee that the SNP structure is respected.

Without using unequal mutation rates or SNPs we find that a typical 67 allele set of haplotype codes into about 100 characters. For example, when applying this to the set of modals for subclades of clade I1, I find the modal for the subclade I1d-NuN14 (Ken Nordtvedt's subclades) codes into

AGAGGGAAAAGAGGAGGGGAGAGAGAGGAAGGAGGAGAGAGAGAAGGGAGAAGAAAGGGGAAGGAAGGAGAAGGAAAGAAGGGAAAGGAAGAGAAGAGAG

When I include unequal mutation rates and also SNPs, I find it requires about 6630 characters per line. That is after neglecting the two fastest alleles CDY-a and CDY-b but including the rest. If you skip a few more of the fastest ones, you can get some more compression.

Nonetheless, PHYLIP programs like DNAPENNY run fairly quickly on such data.

DNAPENNY can search every possible tree and find the most parsimonious one. However, the time required increases exponentially with the number of haplotypes. It can generally finish successfully in 5-6 minutes on my laptop for about 15 haplotypes. Beyond that it takes longer. You can tell it to stop after searching a fixed number of tress, the default being 10 million. In that case, it is not guaranteed to find the most parsimonious tree but I believe it comes very close.

When it finishes, it usually has many tress with equal parsimony scores. You can then run a program called CONSENSE which finds the consensus tree.

So to conclude, I start with a spreadsheet with the 67 marker STR haplotype data, the mutation rates as well as columns for the relevant SNPs. I save this to a .csv file which my program can read. My program converts this to a file that is readable by PHYLIP programs. I copy that to one called infile. I run DNAPENNY at the unix command line and wait a few minutes for it to finish or quit. Then I copy the output file called outtree to one called intree. The I run CONSENSE at the command line. It finishes very quickly and then I have one called outfile.

The following is an example of such an outfile. This was run on the I1 subclade modals computed by Ken Nordtvedt found here .

It seems to do a very good job constructing a tree that respects both the SNP mutations and also Ken's naming conventions. In fact even without including the SNPs, it does a very good job though there are slight changes. The numbers at the nodes are from CONSENSE. They tell the fraction of the total minimum parsimony trees that have that same node. This run of DNAPENNY did not finish so this may not be the exact most parsimonious tree but it should be close.



Here is another tree for my close matches



I also tried the MEGA program. It was a little buggy but works pretty well. The trees looks prettier. It has it's own Parsimony program which I ran on the same data. The result is below.

Wednesday, February 16, 2011

Applying priors on TMRCA based on surname



The match or non-match of a surname can be used as a prior to improve the accuracy of TMRCA calculations. The TMRCA just computes a likelihood based on the DNA data. But the surnames carry more information. If you have different surnames, it is unlikely that your TMRCA is in the last 500 years. Or, to turn that around in a Bayesian fashion, if your TMRCA is within the last 500 years, it is unlikely that your surname will be different. If your TMRCA is say 2000 years ago, it is very likely that your surname will be different. However, there is some chance that it will end up by luck to be the same.

So I have worked out a simple prior that encodes this information and applied it to two people who have similar likelihoods. But one is from Sweden (different surname) and one is from Scotland with my Johnston surname. See the plot above. The black is the likelihood from the DNA data. The red and green solid curves show the priors for surname-match (green) and surname-not-match (red). The dashed lines show the product of likelihood and surname prior with blue being the result depending on surname match or not. I chose 1200 years ago as about the time when people with TMRCA of that date would have a roughly 50-50 chance of ending up with the same surname and chose an effective width for this sigmoid function of roughly 100 years. I also decided not to let this function get too low at distant times. There is always some chance that you end up with the same surname by luck. I picked 10% for that parameter. Clearly one can fiddle around with these parameters.

The effect of the prior moves the TMRCA for the Johnston to slightly more recent years and trims the tale at more distant years. The effect on the person from Sweden is the opposite. Just from the surname and geographical difference, we are unlikely to have a TMRCA within the period of surname creation. That fact moves the TMRCA to more distant times. The Johnston person's maximum a posteriori TMRCA ends up about half the person from Sweden.

Purpose of the blog

This blog is just kept to keep track of and share some notes on genealogy.