Monday, February 28, 2011

Parsimony Tree for I1 clade

The PHYLIP Package (code is here ) has many tools for constructing phylogenic trees from genetic data. One such program is called DNAPENNY. This program constructs a parsimony tree. A parsimony tree is the tree that results in a given dataset using the minimum number of mutations. Since the mutation rates are small for Y-DNA STR data and even smaller for SNP data, the tree (or trees) with the fewest required mutations will generally be the correct tree. PHYLIP has many other useful programs but we will just discuss parsimony trees here.

The trouble with PHYLIP is that it is made to be used for regular DNA sequences; that is strings of GATC characters not microsatellite numbers. David Hamm suggested that we should convert the microsatellite data to GATC format. He also provides a program for Windows computers. I don't have Windows so didn't try his program but I wrote my own version.

The conversion algorithm looks like this. First, lets ignore unequal mutation rates and SNP data which we will discuss in a moment.

Go through each column (or allele) for a set of haplotypes (e.g. people) and find the minimun and maximum value. If they are the same, you can ignore this column since they are all the same and so non-informative from a parsimony perspective. Assume they are different. For example, the minimum might be 13 and the maximum might be 16. This then requires three GATC characters. In general, if the difference is N, it requires N-1 characters. It doesn't matter which characters you use. I use A to mean a count of zero and G to mean a count of 1 so that 13 would be AAA, 14 would be AAG and 15 would be AGG and 16 would be GGG. Now you can see that the number of A-G transitions is in direct correspondence with the difference between the markers.

Missing data can be handled by almost all PHYLIP programs. Missing values are denoted by the X character so that if this column is missing for some haplotypes, it codes to XXX (or however many X are required for this column).

Finally, we need to address how to treat varying mutation rates and also SNP data. We treat SNP data the same way as STR data except that it is either on (G) or off (A). We also should use mutation rates for SNPs that are much, much smaller than STRs so that they take preference in the parsimony criteria. This brings us how to treat unequal mutation rates.

Assume for the moment that we have just two different mutation rates: the fast one, say 0.0100 and the slow one, say 0.0025. Note that the fast one is 4 times faster. It should occur roughly 4 times as frequent. So basically we would like to give the slow allele about 4 times as much weight. One mutation at the slow allele should be roughly as parsimonious as four mutations at the fast one.

This is easy enough to do. Just multiply the data in the slow column by 4. So now, a change by one marker in this column shows up as four mutations and therefore receives the correcting weighting. What makes this correct mathematically speaking is that (1-mu)^4 is approximately equal to 1-4*mu when mu is a small number. (This could be made more exact.)

To generalize this, find the fastest mutation rate, MU_FAST, and set the column weight to MU_FAST/mu, rounded to the nearest whole number. This rounding business introduces some inexactness into the process but mutation rates are not know that accurately anyway. SNPs have much, much smaller mutation rates than STRs and if we used the exact values, we would get enormous weights for those columns resulting in very long strings of GATC characters. In practice, I find that using mutation rates for SNPs that are only a few times smaller than the smallest STR rate works well enough to guarantee that the SNP structure is respected.

Without using unequal mutation rates or SNPs we find that a typical 67 allele set of haplotype codes into about 100 characters. For example, when applying this to the set of modals for subclades of clade I1, I find the modal for the subclade I1d-NuN14 (Ken Nordtvedt's subclades) codes into

AGAGGGAAAAGAGGAGGGGAGAGAGAGGAAGGAGGAGAGAGAGAAGGGAGAAGAAAGGGGAAGGAAGGAGAAGGAAAGAAGGGAAAGGAAGAGAAGAGAG

When I include unequal mutation rates and also SNPs, I find it requires about 6630 characters per line. That is after neglecting the two fastest alleles CDY-a and CDY-b but including the rest. If you skip a few more of the fastest ones, you can get some more compression.

Nonetheless, PHYLIP programs like DNAPENNY run fairly quickly on such data.

DNAPENNY can search every possible tree and find the most parsimonious one. However, the time required increases exponentially with the number of haplotypes. It can generally finish successfully in 5-6 minutes on my laptop for about 15 haplotypes. Beyond that it takes longer. You can tell it to stop after searching a fixed number of tress, the default being 10 million. In that case, it is not guaranteed to find the most parsimonious tree but I believe it comes very close.

When it finishes, it usually has many tress with equal parsimony scores. You can then run a program called CONSENSE which finds the consensus tree.

So to conclude, I start with a spreadsheet with the 67 marker STR haplotype data, the mutation rates as well as columns for the relevant SNPs. I save this to a .csv file which my program can read. My program converts this to a file that is readable by PHYLIP programs. I copy that to one called infile. I run DNAPENNY at the unix command line and wait a few minutes for it to finish or quit. Then I copy the output file called outtree to one called intree. The I run CONSENSE at the command line. It finishes very quickly and then I have one called outfile.

The following is an example of such an outfile. This was run on the I1 subclade modals computed by Ken Nordtvedt found here .

It seems to do a very good job constructing a tree that respects both the SNP mutations and also Ken's naming conventions. In fact even without including the SNPs, it does a very good job though there are slight changes. The numbers at the nodes are from CONSENSE. They tell the fraction of the total minimum parsimony trees that have that same node. This run of DNAPENNY did not finish so this may not be the exact most parsimonious tree but it should be close.



Here is another tree for my close matches



I also tried the MEGA program. It was a little buggy but works pretty well. The trees looks prettier. It has it's own Parsimony program which I ran on the same data. The result is below.

2 comments:

  1. Hi David,

    I must say that it's a pleasure to see Ken's clades in tree form. Simplifies things. Nice summary concept. I like it.

    I've been working on something along the same lines in the event that Rollo's DNA turns out to be I1. (For my DNA Project folks.) A tree helps visualize things.

    However good it is to see, I would have done it differently. (I am not fond of parsimony.)

    For an easier method, I would have suggested taking the clade modals and plugging them into Dean McGee's utility. Using the PHYLIP output, it could then be run through the kitsch program and displayed with the MEGA software. (The MEGA software, by the way, can build a consensus tree from the main menu.)Easier, but I can't say that the results would be much better.

    The graphic output from MEGA can be edited with a paint program to enhance the labeling. (It would be prettier, for what that's worth.)

    With a paint program, for example, you could label where I1 is on the tree. And perhaps label country of origin as well as other footnotes or things of that nature.

    Nice work. I have enjoyed viewing this.

    ReplyDelete
  2. Thanks. I tried the KITSCH program before I did the parsimony approach. I believe that parsimony is better. KITSCH just uses pairwise distances. There is more information about the structure of a tree that isn't in pairwise distances. Plus, you can impose the constraints of SNPs through parsimony. I don't know how to do that with KITSCH but maybe there is some way.

    I should try KITSCH on these same modals to see how it differs.

    I would guess that you could use MEGA on the outputs of DNAPENNY just as well as KITSCH since the tree formats of PHYLIP outputs tend to be uniform. I would be happy to send you the outputs if you want.

    ReplyDelete