Principles of Evolution Web Page
                        Lectures
 



LECTUREs 20 and 21:    PHYLOGENY RECONSTRUCTION

I.  The Problem

    A.  How do we determine which lineages (taxa) are most closely related to which other lineages?

    B.   How do we construct a phylogeny, which portrays the relationships between different lineages?

        1.  Intuitive notion is that taxa that are more similar to each other morphologically, behaviorally,
             and physiologically are likely to be more closely related than taxa that are less similar in
             these characters.

        2.  This notion is based on the principle that two taxa that are recently descended from a common
             ancestral lineage will not have diverged much genetically and therefore should be similar
             morphologically, behaviorally, and physiologically.

        3.  But there are several problems with using overall similarity as a measure of relatedness.

             a.  What characters does one use to measure similarity?
             b.  How does one measure degree of similarity or dissimilarity
             c.  Similarity may reflect convergence rather than recenct descent from a common ancestor.

II.  Approaches to Phylogeny Reconstruction

    A.  Phenetic or Distance approaches.

        1.  These approaches use some metric of divergence (distance) separating pairs of taxa as a basis for
             grouping similar taxa.

        2.  One example of such a metric is Nei's genetic distance, which was discussed in conjunction with
             Coyne and Orr's study of speciation in Drosophila.

        3.  Basic approach

             a.  Calculate all pairwise distances between taxa using the chosen metric
             b.  group together the two taxa that exhibit the smallest distance--these are considered to
                  be the two most closely related taxa.
             c.  Consider the two combined taxa to represent one new taxon, and calculate the distance
                 of this new taxon to all the other taxa (e.g. calculate the average of the distances to each
                 of the two taxa grouped together)
             d. Group the next two taxa exhibiting the smallest distance together to form the next closely
                 related group.
             e.  Repeat steps c. and d. until all taxa have been groupeed.

        4.  Disadvantages:  Sensitive to similarities due to convergent evolution

    B.  Parsimony Approach

        1.  Examines all possible phylogenetic trees and calculates for each one the minimum number of
             character-state changes required to explain observed character-state distribution among taxa

        2.  Chooses tree with the minimum number of required changes as the most likely tree.

        3.  Example

             a.  Consder three taxa, I , II, and III, which exhibit character states A, A, and B respectively..
             b.  For three taxa, there are three possible topologies for a phylogeny:
 


 

           c.  Usually, to distinguish among these three topologies, we employ an outgroup, which is a
                fourth taxon that we can be reasonably sure from other evidence is more distantly related
                to each of the other three taxa than any of those taxa are related to each other.
           d.  Let us suppose the character state of this outgroup (designated by O) is B. This gives the following three
                topologies to choose from:
 

           e.  Let us now consider each in turn, and ask what is the minimum number of character
                transitions required to account for the observed distribution of character states among
                taxa.
           f.  For the first topology, only one transition is required, a transtion from B  to A, at the indicated
               position.
           g.  For the second topology, a minimum of two transitions are required:

                i.  either B to A in each of the final lineages leading to taxa I and II, or
                ii.  B to A in the lineage between the common ancestor and the point at which taxon II
                     branches off and then a reversion, A to B, in the final lineage leading to taxon III.
           h.  Similarly,  a minimum of two transitions are required for the third topology.
           i.   Consequently, we deem topology 1 more likely because it requires fewer evolutionary
                transitions to produce the observed distribution of character states.

    C.  Maximum likelihood approach

       1.  This approach is usually applied to amino acid or nucleotide sequence data.

       2.  It assumes an explicit model of how amino-acid or nucleotide transitions evolve.

       3.  Before attempting to understand this approach, we need to understand the general principles of
            likelihood analysis.

III. The Principle of Maximum Likelihood.

    A.  Example of a statistical problem: Estimating the probability a flipped coin will come up heads.

        1.  A fair coin will, when flipped, land heads-up 50% of the time.

        2.  However, how do we know whether any particular coin is fair?

        3.  This question is equivalent to asking what the probability is, for that coin, of landing heads-up on
             a coin toss.

    B.  Estimating probability of heads

        1.  Suppose we take our coin and flip it N times, and we get n1 heads.

        2.  Suppose further that the true probability of obtaining a heads is p.

        3.  What is the probability, given p , that we would obtain exactly n1 heads in N flips?

        4.  Statistically, this is the same problem as estimating the probability of obtaining n1  black
             marbles in a sample of N marbles drawn from an urn in which a fraction p of the marbles are black,a
             problem encountered in our consideration of changes in gene frequency due to drift.

        5.  The solution is the same: the probability is given by the binomial distribution:

                   Prob[n heads in N flips] = [N! / n! (N-n)! ]  pn(1-p)N-n

        6.  This is our probability model.  Actually, it is a family of models: each possible value of p corresponds
             to a distinct model.

        7.  The problem of estimating the probability of heads is then equivalent to the problem of chosing the
             most likely value of p, which in turn is equivalent to chosing the most likely individual model from the
             family of models.

        8.  We need some criterion to make this choice. To describe this criterion, we define the Likelihood of a
             model as

              L(model | n heads in N flips ) = L(p | n heads in N flips ) = Prob[ n heads in N flips | model]

        9.  We then say that the most reasonable model is the one with the largest likelihood, i.e. the one for
             which the probability of obtaining the data is highest.

    C.  Numerical Example

        1.  Suppose that in 10 flips of a coin, 7 heads and 3 tails were obtained.

        2.  We can calculate the likelihood for a series of possible values of p.

        3.  Suppose we assume p = 0.5.

        4.  Then L(p=0.5 | 7 heads in 10 flips) = 0.1171

        5.  Suppose we assume p=0.3.

        6.  Then L(p=0.3 | 7 heads in 10 flips) = 0.0090.

        7.  Suppose we assme p= 0.7.

        8.  Then L(p=0.5 | 7 heads in 10 flips) = 0.2668

        9.  In fact, we can calculate the likelihood for any value of p:
 
 

   FIGURE

        10.  From the figure, it can be seen that the value of p with the highest likelihood, i.e. the model with
               the highest likelihood, is p=0.7.

        11.  Hence, our best estimate of the probability that thecoin will turn up heads is 0.7.

    D.  Conclusion.

        1.  The Maximum Likelihood criterion provides a method for choosing among possible models that
             differ in the value of one or more parameter (in the coin-toss example, there was only one parameter,
             p).

        2.  The criterion says: calculate the likelihood of the model given the data, which is just equal to the
             probability of obtaining the data under the assumptions of the model, and choose the model with
             the highest likelihood.

        3.  This is the model that has the highest probability of generating the observed data.

IV.  Phylogenetic Reconstruction using Likelihood

    A.  Data

        1.  For likelihood analyses, the data takes the form of either nucleotide sequences or amino acid sequences

        2.  Example:  The following is the first 50 nucleotides of the gene Chalcone Synthase, which codes for the
             first enzyme in the anthocyanin pigment pathway in plants:

          ATGGTGACCGTCGAGGAAGTACGGAAGGCGCAACGTGCCCAGGGTCCGGC
          ATGGTGACTGTTGAGGAGGTTCGTAGGGCTCAGAGGGCTGAGGGACCGGC
          ATGGTGACAGTCGAGGAGTATCGTAAGGCACAACGTGCTGAAGGTCCAGC

            Each row gives the sequence for a different species: 1st row--morning glories, 2nd row--snapdragon,
            3rd row--petunia.  A phylogeny could be constructed from this data.

        3.  The overall data consists of a set of data partitions

            a.  For nucleotide sequences, a data partition corresponds to a single nucleotide position in
                 the sequence, and consists of a vector whose elements correspond to the nucleotide at
                 position in the different taxa.
            b.  The first partition of the above dataset is the vector (A,A,A).  The second partition is the
                 vector (T,T,T), etc.

    B.  The method in outline, with Example:  three taxa, two "nucleotides" per site (for a detailed description
          of the method, visit the Handout)
 

        1.  Objectives:

            a.  List all possible tree topologies
            b.  Calculate likelihood of each tree given the data
            c.  Identify tree with highest likelihood as most likely phylogeny

        2.  Likelihood of tree topology given all the data is product of likelihoods for individual data partitions;
             therefore, start by calculating likelihood of each tree for each data partition.

            a.  Suppose the data are :

                    Taxon 1       A B B A . . .
                    Taxon 2       A B A A . . .
                    Taxon 3       B  B A  B . . .

            b.  First data partition is A,A,B

        3.  With three taxa there are three possible tree topologies for this data partition:
 

        4.  For each topology there are four possible "arrangements" of  unknown ancestral states i and j.  For
             example, for tree topology 1, the four possible arrangements are::
 

        5.  The likelihood of a given topology is the weighted average of the likelihoods of its four possible
             arrangements; we will not here calculate this average (see Handout for details).  Instead, we will
             describe how the likelihood of an indvidual arrangement is calculated.

        6.  The likelihood model: the likelihood of an arrangement

            a.  Assuming evolutionary substitution is a Poisson process, the probability A has been replaced
                 by B after t time units is

                            P[A->B|t] = 1/2 - (1/2) e-rt

                 where r is probability per unit time that A is replaced by B.
            b.  Similarly, the probability A has not been replaced by B after t time units is

                            P[A->A|t] =  1/2 +(1/2) e-rt

     &nbspnbsp;     c.  Consider the following arrangement of topology 1 for the first data partition:

            d.  This arrangement may be characterized in the following way:

                        P[arrangement| A,A,B] = P[A->B|t1]  x  P[A->A|t2]  x  P[A->A|t3]  x  P[A->At4]

                                        =  [1/2 -(1/2)e-rt1] [1/2 + (1/2)e-rt2] [1/2 + (1/2)e-rt3] [1/2 + (1/2)e-rt4]

            e.  The probabilities of the other arrangements of tree topology 1 are calculated in similar
                 fashion and are averaged to give the probability of tree topology 1 given the first data
                 partition.

              f.   But, since the likelihood of the topology given the data partition is just the probability of
                   the data given the topology, that average also gives the likelihood of tree topology 1 given
                   the first data partition.

            g.  When this average is calculated (see Handout for details), we find that the likelihood of tree
                 topology 1 is

                                L[topology 1|A,A,B] = (1/8) [2 - e-r(2t2+t3)]

            h. Similarly, the likelihoods of tree topologies 2 and 3 are

                                L[topology 2|A,A,B] = L[topology 2|A,A,B] =  (1/8) [1 - e-2rt3]

            i.  Normally, one would then find the combination of parameters r, t2 and t3 that maximize the likelihood
                for each topology, then choose the topology with the highest likelihood.  In this case, however, one can
                compare the likelihoods directly:   [2 - e-r(2t2+t3)] > 1 and [1 - e-2rt3]  < 1, so tree topology 1 has the highest
                likelihood, given data partition 1.

            j.  This example illustrates the general rule for any informative data partition (i.e. any partition
                in which two of the taxa exhibit one nucleotide and the other taxon exhibits the other
                nucleotide): the topology in which the two taxa with identical nucleotides are most closely
                related has the highest likelihood.  The data partition is said to "favor" that topology.

        7.  Combining data partitions

            a.  Objective: To determine which topology has the highest likelihood given the entire data set
            b.  General rule:  The topology favored by the most informative data partitions is the most likely
                 topology, given the entire data set.  (see Handout for proof).

    C.  Conclusions:

        1.  For a three-taxon phylogeny, the most likely topology is the one that is supported by the most data
             partitions.

        2.  For real data, with four possible nucleotides rather than 2, the same rules still hold:

            a.  "informative" data partitions are those in which two taxa have the same nucleotide
                 and the third taxon has a different nucleotide.
            b.  For any informative data partition, that topology is favored that places the two taxa
                 with the same nucleotide as sister taxa.
            c.   The topology favored by the most informative data partitions is the most likely topology,
                  given the entire data set.

        3.  For more than three taxa, expressions for the likelihood of the data can be calculated in similar fashion
             for all possible tree topologies.  For each topology, the values of the parameters that yield the
             maximum likelihood are found.  The topology with the highest maximum likelihood is then the most
             likely phylogeny.


V.  Comparison of the Methods

    A.  The Problem

        1.  An investigator, when wishing to reconstruct the phylogeny of the group in which he/she is interested,
            is confronted with the problem of which method of reconstruction to employ.

        2.  A great deal of effort has been expended in attempts, using computer-generated phylogenies, to
             determine which of the three major approaches (e.g. Distance, Parsimony, or Likelihood) more reliably
             reconstructs the true phylogeny.

        3.  To date, there is great disagreement among proponents of the different methods as to which is most
             reliable.

        4.  However, recently, it has become realized that the proper way to evaluate the performance of the
             different methods is to compare their ability to reconstruct known phylogenies.

    B.  Example:  Cunningham et al. 

        1.  Procedure and Rationale

            a.  Allowed 4 lines of bacteriophage T7 descended from a common ancestor to evolve in
                 laboratory cultures to generate a known phylogeny.
            b.  Sequenced sections of the T7 genome from each line.
            c.  Used the different techniques to reconstruct the phylogeny of the 4 lines
            e.  Compared reconstructed phylogenies to known phylogeny to determine which method
                 was most accurate.

        2.  Creating Experimental lines

            a.  Two lines were begun from a common ancestor and each was propagated for 20 lytic
                 cycles.
            b.  Each lineage was then "bottlenecked", i.e. reduced to a single individual used to propagate
                 the culture.
            c.  Two sublineages were then begun from each bottlenecked lineage, giving a total of four
                 lineages altogether.
            d.  Each of these four lineages was bottlenecked after 50, 100, and 150 lytic cycles, and samples
                 of each lineage were preserved at each of these bottlenecking events:

 

     
        3.  Experimental phylogenies

            a.  Three types of "known" phylogenies were established using these experimental lines by
                 varying which samples were used as the "terminal" taxa:

            b.  One type (I in figure) used the samples at 150 lytic cycles for each of the
                 lineages K,L, U and V.

                i. This phylogeny represents equal rates of evolution in the four lineages

            c.  A second type (II in figure) used samples from the K and V lineages at 150 lytic cycles,
                 and samples from the L and U lineages at 100 lytic cycles
       
                i.  This phylogeny represents more rapid evolution in lineages K and V than in
                     lineages L and U.

            d.  A third type (III in figure) used samples from the K and V lineages at 150 lytic cycles,
                 and samples from the L and U lineages at 50 lytic cycles.

                i.  This phylogeny represents even more rapid evolution in lineages K and V.

         

 

           











   

   4.  Results

            a.  Comparing Parsimony with Likelihood.

                i.  The figure to the left portrays the best phylogenies reconstructed
                    using Parsimony and using three variants of the Likelihood method.

                ii.  When rates of evolution are similar along all four lineages (Row A; known
                     phylogeny is I in previous figure), Parsimony and Likelihood both reconstruct the
                     appropriate  phylogeny with similar reliability.  (The numbers above the internal
                     branch represent an index of reliability).

                iii.  Same result when rates of evolution along K and V are slightly faster
                      than along L and U (Row B; known phylogeny is II in previous figure)

                iv.  However, when rate of evolution along K and V are much faster than
                      along L and U (Row C; known phylogeny is III in previous figure), the
                      likelihood approach reconstructs the correct tree, while the Parsimony
                      method does not.

            b.  Similar results when compared performance of Distance and Likelihood methods:
       
                i.  The two methods performed equally well when rate of evolution was
                                 comparable along all four branches.

                ii.  However, when rates differed, Likelihood method was better.

        5.  Conclusion:  Likelihood methods are best because they perform as well or better than Parsimony or
             Distance methods.  In particular, Likelihood performs better when there are unequal rates of                                                                
             substitution in different lineages.






 


    [Lecture Outlines | Bio 120 Home Page | Department of Biology | Duke University ]