I. The Problem
A. How do we determine which lineages (taxa) are most closely related to which other lineages?
B. How do we construct a phylogeny, which portrays the relationships between different lineages?
1. Intuitive notion is that taxa that are more similar to each other
morphologically, behaviorally,
and physiologically are likely to be more closely related than taxa that
are less similar in
these characters.
2. This notion is based on the principle that two taxa that are recently
descended from a common
ancestral lineage will not have diverged much genetically and therefore
should be similar
morphologically, behaviorally, and physiologically.
3. But there are several problems with using overall similarity as a measure of relatedness.
a. What characters does one use to measure similarity?
b. How does one measure degree of similarity or dissimilarity
c. Similarity may reflect convergence rather than recenct descent
from a common ancestor.
II. Approaches to Phylogeny Reconstruction
A. Phenetic or Distance approaches.
1. These approaches use some metric of divergence (distance) separating
pairs of taxa as a basis for
grouping similar taxa.
2. One example of such a metric is Nei's genetic
distance, which was discussed in conjunction with
Coyne and Orr's study of speciation in Drosophila.
3. Basic approach
a. Calculate all pairwise distances between taxa using the chosen
metric
b. group together the two taxa that exhibit the smallest distance--these
are considered to
be the two most closely related taxa.
c. Consider the two combined taxa to represent one new taxon, and
calculate the distance
of this new taxon to all the other taxa (e.g. calculate the average of
the distances to each
of the two taxa grouped together)
d. Group the next two taxa exhibiting the smallest distance together to
form the next closely
related group.
e. Repeat steps c. and d. until all taxa have been groupeed.
4. Disadvantages: Sensitive to similarities due to convergent evolution
B. Parsimony Approach
1. Examines all possible phylogenetic trees and calculates for each
one the minimum number of
character-state changes required to explain observed character-state distribution
among taxa
2. Chooses tree with the minimum number of required changes as the most likely tree.
3. Example
a. Consder three taxa, I , II, and III, which exhibit character states
A, A, and B respectively..
b. For three taxa, there are three possible topologies for a phylogeny:
c. Usually, to distinguish among these three topologies, we employ
an outgroup, which is a
fourth taxon that we can be reasonably sure from other evidence is more
distantly related
to each of the other three taxa than any of those taxa are related to each
other.
d. Let us suppose the character state of this outgroup (designated
by O) is B. This gives the following three
topologies to choose from:
e. Let us now consider each in turn, and ask what is the minimum
number of character
transitions required to account for the observed distribution of character
states among
taxa.
f. For the first topology, only one transition is required, a transtion
from B to A, at the indicated
position.
g. For the second topology, a minimum of two transitions are required:
i. either B to A in each of the final lineages leading to taxa I
and II, or
ii. B to A in the lineage between the common ancestor and the point
at which taxon II
branches off and then a reversion, A to B, in the final lineage leading
to taxon III.
h. Similarly, a minimum of two transitions are required for
the third topology.
i. Consequently, we deem topology 1 more likely because it
requires fewer evolutionary
transitions to produce the observed distribution of character states.
C. Maximum likelihood approach
1. This approach is usually applied to amino acid or nucleotide sequence data.
2. It assumes an explicit model of how amino-acid or nucleotide transitions evolve.
3. Before attempting to understand this approach, we need to understand
the general principles of
likelihood analysis.
III. The Principle of Maximum Likelihood.
A. Example of a statistical problem: Estimating the probability a flipped coin will come up heads.
1. A fair coin will, when flipped, land heads-up 50% of the time.
2. However, how do we know whether any particular coin is fair?
3. This question is equivalent to asking what the probability is,
for that coin, of landing heads-up on
a coin toss.
B. Estimating probability of heads
1. Suppose we take our coin and flip it N times, and we get n1 heads.
2. Suppose further that the true probability of obtaining a heads is p.
3. What is the probability, given p , that we would obtain exactly n1 heads in N flips?
4. Statistically, this is the same problem as estimating the probability
of obtaining n1 black
marbles in a sample of N marbles drawn from an urn in which a fraction
p of the marbles are black,a
problem encountered in our consideration of changes in gene frequency due
to drift.
5. The solution is the same: the probability is given by the binomial distribution:
Prob[n heads in N flips] = [N! / n! (N-n)! ] pn(1-p)N-n
6. This is our probability model. Actually, it is a family
of models: each possible value of p corresponds
to a distinct model.
7. The problem of estimating the probability of heads is then equivalent
to the problem of chosing the
most likely value of p, which in turn is equivalent to chosing the
most likely individual model from the
family of models.
8. We need some criterion to make this choice. To describe this criterion,
we define the Likelihood of a
model as
L(model | n heads in N flips ) = L(p | n heads in N flips ) = Prob[ n heads in N flips | model]
9. We then say that the most reasonable model is the one with the
largest likelihood, i.e. the one for
which the probability of obtaining the data is highest.
C. Numerical Example
1. Suppose that in 10 flips of a coin, 7 heads and 3 tails were obtained.
2. We can calculate the likelihood for a series of possible values of p.
3. Suppose we assume p = 0.5.
4. Then L(p=0.5 | 7 heads in 10 flips) = 0.1171
5. Suppose we assume p=0.3.
6. Then L(p=0.3 | 7 heads in 10 flips) = 0.0090.
7. Suppose we assme p= 0.7.
8. Then L(p=0.5 | 7 heads in 10 flips) = 0.2668
9. In fact, we can calculate the likelihood for any value of p:
FIGURE
10. From the figure, it can be seen that the value of p with
the highest likelihood, i.e. the model with
the highest likelihood, is p=0.7.
11. Hence, our best estimate of the probability that thecoin will turn up heads is 0.7.
D. Conclusion.
1. The Maximum Likelihood criterion provides a method for choosing
among possible models that
differ in the value of one or more parameter (in the coin-toss example,
there was only one parameter,
p).
2. The criterion says: calculate the likelihood of the model given
the data, which is just equal to the
probability of obtaining the data under the assumptions of the model, and
choose the model with
the highest likelihood.
3. This is the model that has the highest probability of generating the observed data.
IV. Phylogenetic Reconstruction using Likelihood
A. Data
1. For likelihood analyses, the data takes the form of either nucleotide sequences or amino acid sequences
2. Example: The following is the first 50 nucleotides of the
gene Chalcone Synthase, which codes for the
first enzyme in the anthocyanin pigment pathway in plants:
ATGGTGACCGTCGAGGAAGTACGGAAGGCGCAACGTGCCCAGGGTCCGGC
ATGGTGACTGTTGAGGAGGTTCGTAGGGCTCAGAGGGCTGAGGGACCGGC
ATGGTGACAGTCGAGGAGTATCGTAAGGCACAACGTGCTGAAGGTCCAGC
Each row gives the sequence for a different species: 1st row--morning glories,
2nd row--snapdragon,
3rd row--petunia. A phylogeny could be constructed from this data.
3. The overall data consists of a set of data partitions
a. For nucleotide sequences, a data partition corresponds to a single
nucleotide position in
the sequence, and consists of a vector whose elements correspond to the
nucleotide at
position in the different taxa.
b. The first partition of the above dataset is the vector (A,A,A).
The second partition is the
vector (T,T,T), etc.
B. The method in outline, with Example: three taxa, two "nucleotides"
per site (for a detailed description
of the method, visit the Handout)
1. Objectives:
a. List all possible tree topologies
b. Calculate likelihood of each tree given the data
c. Identify tree with highest likelihood as most likely phylogeny
2. Likelihood of tree topology given all the data is product of likelihoods
for individual data partitions;
therefore, start by calculating likelihood of each tree for each data partition.
a. Suppose the data are :
Taxon 1 A B B A . . .
Taxon 2 A B A A . . .
Taxon 3 B B A B . . .
b. First data partition is A,A,B
3. With three taxa there are three possible tree topologies for this
data partition:
4. For each topology there are four possible "arrangements" of
unknown ancestral states i and j. For
example, for tree topology 1, the four possible arrangements are::
5. The likelihood of a given topology is the weighted average of
the likelihoods of its four possible
arrangements; we will not here calculate this average (see Handout for
details). Instead, we will
describe how the likelihood of an indvidual arrangement is calculated.
6. The likelihood model: the likelihood of an arrangement
a. Assuming evolutionary substitution is a Poisson process, the probability
A has been replaced
by B after t time units is
P[A->B|t] = 1/2 - (1/2) e-rt
where r is probability per unit time that A is replaced by B.
b. Similarly, the probability A has not been replaced by B after
t time units is
P[A->A|t] = 1/2 +(1/2) e-rt
 nbsp; c. Consider the following arrangement of topology 1 for the first data partition:
P[arrangement| A,A,B] = P[A->B|t1] x P[A->A|t2] x P[A->A|t3] x P[A->At4]
= [1/2 -(1/2)e-rt1] [1/2 + (1/2)e-rt2] [1/2 + (1/2)e-rt3] [1/2 + (1/2)e-rt4]
e. The probabilities of the other arrangements of tree topology 1
are calculated in similar
fashion and are averaged to give the probability of tree topology 1 given
the first data
partition.
f. But, since the likelihood of the topology given the data
partition is just the probability of
the data given the topology, that average also gives the likelihood of
tree topology 1 given
the first data partition.
g. When this average is calculated (see Handout
for details), we find that the likelihood of tree
topology 1 is
L[topology 1|A,A,B] = (1/8) [2 - e-r(2t2+t3)]
h. Similarly, the likelihoods of tree topologies 2 and 3 are
L[topology 2|A,A,B] = L[topology 2|A,A,B] = (1/8) [1 - e-2rt3]
i. Normally, one would then find the combination of parameters r,
t2 and t3 that
maximize the likelihood
for each topology, then choose the topology with the highest likelihood.
In this case, however, one can
compare the likelihoods directly: [2 - e-r(2t2+t3)] > 1 and [1 -
e-2rt3] < 1, so tree topology
1 has the highest
likelihood, given data partition 1.
j. This example illustrates the general rule for any informative
data partition (i.e. any partition
in which two of the taxa exhibit one nucleotide and the other taxon exhibits
the other
nucleotide): the topology in which the two taxa with identical nucleotides
are most closely
related has the highest likelihood. The data partition is said
to "favor" that topology.
7. Combining data partitions
a. Objective: To determine which topology has the highest likelihood
given the entire data set
b. General rule: The topology favored by the most informative
data partitions is the most likely
topology, given the entire data set. (see Handout for proof).
C. Conclusions:
1. For a three-taxon phylogeny, the most likely topology is the one
that is supported by the most data
partitions.
2. For real data, with four possible nucleotides rather than 2, the same rules still hold:
a. "informative" data partitions are those in which two taxa have
the same nucleotide
and the third taxon has a different nucleotide.
b. For any informative data partition, that topology is favored that
places the two taxa
with the same nucleotide as sister taxa.
c. The topology favored by the most informative data partitions
is the most likely topology,
given the entire data set.
3. For more than three taxa, expressions for the likelihood of the
data can be calculated in similar fashion
for all possible tree topologies. For each topology, the values of
the parameters that yield the
maximum likelihood are found. The topology with the highest maximum
likelihood is then the most
likely phylogeny.
V. Comparison of the
Methods
A. The Problem
1. An investigator, when wishing
to reconstruct the phylogeny of the group in which he/she is interested,
is confronted
with the problem of which method of reconstruction to employ.
2. A great deal of effort
has been expended in attempts, using computer-generated phylogenies, to
determine
which of the three major approaches (e.g. Distance, Parsimony, or Likelihood)
more reliably
reconstructs
the true phylogeny.
3. To date, there is great
disagreement among proponents of the different methods as to which is most
reliable.
4. However, recently, it has
become realized that the proper way to evaluate the performance of the
different
methods is to compare their ability to reconstruct known phylogenies.
B. Example: Cunningham et al.
1. Procedure and Rationale
a. Allowed
4 lines of bacteriophage T7 descended from a common ancestor to evolve in
laboratory cultures to generate a known phylogeny.
b. Sequenced
sections of the T7 genome from each line.
c. Used
the different techniques to reconstruct the phylogeny of the 4 lines
e. Compared
reconstructed phylogenies to known phylogeny to determine which method
was most accurate.
2. Creating Experimental lines
a. Two
lines were begun from a common ancestor and each was propagated for 20 lytic
cycles.
b. Each
lineage was then "bottlenecked", i.e. reduced to a single individual used
to propagate
the culture.
c. Two
sublineages were then begun from each bottlenecked lineage, giving a total
of four
lineages altogether.
d. Each
of these four lineages was bottlenecked after 50, 100, and 150 lytic cycles,
and samples
of each lineage were preserved at each of these bottlenecking events: