I. The Molecular Clock
A. Example of molecular evolution: cytochrome C
B. Rates of Amino Acid sequence evolution--Zuckerkandl and Pauling
1. In the early 1960's, protein sequencing had recently been invented,
and this allowed
Zuckerkandl and Pauling to compare the sequences of homologous genes
across a
number of different taxa.
2. One obvious pattern that emerged was that closely related taxa
had relatively similar sequences for a
given protein, while more distantly related taxa had more divergent
sequences for the same protein.
3. This pattern forms the basis for phylogeny reconstruction using
molecular sequences, which is
considered in the next lecture.
4. A second pattern that became evident was that protein sequences
seemed to diverge at a more-or-
less constant rate, which gave rise to the notion of the "molecular
clock".
C. Example: molecular clock for a globin.
1. Data collection:
a. Compare sequences of pairs of species and determine the number
of sites at which the two
species of a pair differ in amino-acid composition (number of substitutions).
b. For each pair, also estimate (from paleontological data) the
time since the common ancestor
of the pair.
c. Plot, for each pair, number of substitutions separating the
pair vs. time since separation.
2. The figure below plots this data for a globin.
a.
Time axis is in millions of years
b.
Each point represents a comparison
of the indicated species with humans.
3. Conclusion: Although there is some scatter about the
line, the data strongly suggest that number
&nbsng
a species pair is linearly proportional to the amount of time those species
have been diverging.
4. The linearity indicates that the rate of divergence is approximately
constant, i.e. the molecular clock
ticks with a constant rate.
D. Another example.
1. One is not restricted to examining just one protein to estimate
the relationship between number of
substitutions and divergence time.
2. In this example, Walter Fitch combined the data from seven
different gene products (cytochrome c,
fibrinopeptides A and B, and hemoglobins, the insulin C-peptide,
and myoglobin) and compared
sixteen pairs of mammal species:
E. Generality of the phenomenon:
1. The constant divergence rate of protein sequences is the rule
for virtually all proteins and taxa that
have been examined.
2. A phenomenon this universal demands some sort of general evolutionary explanation.
II. Explanations for the Molecular Clock--The Neutralist and Selectionist Views
A. The Neutralist View.
1. The Neutralist view maintains that most protein evolution is due to genetic drift.
2. Specifically, the neutralist view asserts the following points,
with regard to non-synonymous,
or replacement, mutations:
a. Most mutations are deleterious, and are therefore eliminated
by natural selection as soon
as they occur.
b. Some mutations have negligible effect on fitness, and are therefore
effectively neutral. A small
but real fraction of these mutations will be fixed by genetic drift
and thus cause evolutionary
change in protein sequence. (Remember, an allele need not have
strictly no selective
advantage or disadvantage for its frequency to be governed by drift;
the strength of selection
just must be small, i.e. satisfy the criterion 4Ns <<1.)
This fixation of effectively neutral
alleles by drift accounts for most protein evolution.
c. Mutations that are advantageous are very rare, because organisms
are close to optimally
adapted and therefore very few "improvments" can be made.
B. The Selectionist View
1. The Selectionist view maintains that most protein evolution is due to natural selection.
2. Specifically, the Selectionist view maintains asserts the following
points with regard to non-
synonymous mutations:
a. Most mutations are deleterious, and are therefore eliminated
by natural selection as soon
as they occur.
b. Relatively few mutations have negligible effects on fitness,
and are thereby effectively
neutral and governed by drift.
c. A substantial number of mutations are selectively advantageou
and are fixed by natural
selection when they arise. These mutations account for most protein
evolution.
C. Comparison of Neutralist and Selectionist Views
1. Both views agree that natural selection very commonly eliminates
deleterious mutations when they
arise--a phenomenon known as "purifying selection".
2. The views disagree on the relative abundance of neutral vs. selectively advantageous mutations.
3. Since only these latter two types of mutations can cause evolutionary
change in protein sequences,
the two views disagree about the relative importance of genetic drift
and natural selection in causing
protein evolution.
4. Nature of the disagreement
a. It is easiest to understand this disagreement in terms of enzyme
function (remember many,
if not most, proteins are enzymes).
b. An enzyme is characterized not only by the identity and ordering
of its amino acids, but
also by a complex three-dimensional structure.
c. Embedded within that three-dimensional structure are pockets
or cavities known as
active sites, where the enzyme interacts with its substrates.
The precise configuration
of the active site is usually crucial for proper enzyme functioning.
The active site normally
involves only a small fraction of all the amino acids in the enzyme.
Three-dimensional
structure of the enzyme hexokinase. In (a) the active site of
the enzyme lies in the center groove. (b) portrays the substrate,
glucose, in the active site. Note that complexing with substrate
causes a change in enzyme configuration. (From N. A. Campbell.
1990. Biology. Benjamin/Cummings.)
d. The remainder of the three-dimensional structure of the enzyme
may also be important
for enzyme functioning, e.g the surface charge of the enzyme may reflect
it's solubility
properties.
e. We are fairly ignorant, however, about the general effect
of an amino acid substitution in
positions away from the active site on the functioning of the enzyme.
f. This ignorance provides plenty of room for selectionist
and neutralists to argue about the
effects of such substitutions.
g. On the one hand, the neutralists maintain that substitutions
away from the active site,
particularly on the inside of an enzyme away from the outer surface,
are not likely to cause
significant changes in the chemical properties of the enzyme, and thus
are not likely to effect
its functioning.
h. For this reason, they assert that most amino-acid substitutions
are neutral.
i. By contrast, the selectionists assert that many amino
acid substitutions will cause changes
in the chemical properties of enzymes, simply by virtue of change in
charge or polarity.
And of those that don't alter these characteristics, virtually
all,
even those far away from the active site, have the potential to cause,
for example,
slight changes in "tension" in the protein chain, which
may ramify down the chain to the active site and distort it. In
most cases these effects
will be detrimental, but sometimes, especially when a lineage has recently
experienced an
environmental change, they will be beneficial and therefore favored
by natural selection.
j. The basic nature of the disagreement between selectionists
and neutralists thus resides in
their views of how mutations are expected to affect enzyme functioning
in general--a
disagreement that is not easily open to direct experimental assessment.
D. Explanation of the Molecular Clock by the Neutralists.
1. The molecular clock follows directly from the properties of genetic
drift, as long as one additional
assumption is made: the rate of mutation remains more or less constant
over time.
2. Derivation:
a. Let N be the number of individuals in a population, and
be the mutation rate (i.e.
number of mutations/gene copy/generation).
b. Typical mutation rates for genes are on the order of 10-6
- 10-4/gene copy/generation.
c. The number of mutations arising per generation in a population
is given by
No. new mutations = (No. copies of gene) x mutation rate
= 2N x m.
d. The initial probability that a newly arisen neutral mutation
will be fixed by genetic drift
is just its initial frequency, i.e.
Prob. of fixation = 1/2N.
e. The rate of fixation of new mutations is just
Mutations fixed/generation = (New mutations/generation) x (Prob. of fixation)
= (2N x m) x (1/2N)
= m .
f. This remarkable result says that for alleles governed by
drift, the rate of fixation of mutations
is equal to the mutation rate.
g. Assuming constant mutation rates over time, this result says
that the rate at which novel
neutral mutations accumulate is constant, i.e. there should be a molecular
clock.
E. Explanation of the Molecular Clock by the Selectionists.
1. Those who believe that most amino acid substitutions are favored
by natural selection do not have
such an elegant explanation for the molecular clock.
2. Rather, they simply postulate that for some reason, averaged
over long geological time spans, the
rate at which natural selection fixes new amino acid substitutions is
rougly constant.
3. According to this view, there may be rate homogeneities over
short geological periods (e.g on the
order of thousands to tens of thousands of generatations), but these fluctuations
average out over
longer periods of time (millions of generations) to produce an apparent
constancy of fixation rate
over this time scale.
4. Of course, although this explanation does not follow logically
from a few simple assumptions like the
neutralists' argument, it may of course be correct.
F. Conclusions
1. Both the neutralists and the selectionists have an explanation for the molecular clock.
2. Which explanation is correct will depend more generally on whether
the neutralists' or selectionists'
view of the cause of protein sequence evolution is correct.
3. We therefore now turn to consideration of evidence bearing on
this question.
III. Distinguishing between Selectionist and Neutralist Views: Examples of Adaptation
A. John Gillespie has argued that there are numerous examples of
amino-acid substitutions that are clearly
adaptive, and that this evidence argues strongly in favor of the notion
that many, if not most amino-acid
substitutions have been caused by natural selection.
B. Example: Evolution of insulin in rodents.
1. In most animals, insulin in storage form is a hexamer, which
is made up of three dimers; each
pair of monomers in a dimer is held together by a Zinc ion that
binds to a Histidine at
position B10.
Schematic
portrayal of pig insulin monomer. The location of some of the amino-acid
changes separating pigs from guinea pigs are indicated. Note that
many of the changes have occurred in the region of dimer contact. (From
J. H. Gillespie. 1991. The Causes of Molecular Evolution.
Cambridge Univ. Press)
2. In histricomorph rodents (guinea pigs and relatives), however,
the storage form of
insulin is a monamer.
3. Although the evolutionary reason for this change is not understood,
it is clear that
a number of amino-acid substitutions have occurred that act to stabilize
insulin in
monomeric form
4. Most importantly, the B10 Histidine has been substituted by other
amino acids, so that
Zinc is no longer able to bind the monomers.
5. In addition, there have been quite a few substitutions of more
hydrophilic for less hydrophilic
amino acids in the regions in which the dimers contact. These substitutions
help prevent
the monomers from coming close enough together to form dimers.
6. In short, these substitutions all occur on the surface of the
molecular and make the monomers
more thermodynamically stable in an aqueous medium.
7. These changes have resulted in a greatly accelerated rate of
evolutionary change in insulin
protein sequence within the histricomorph rodents.
Numbers
of amino acid substitutions separating the insulin sequence of several
species. The Histricomorph rodents are those surrounded by the
dashed line. Note the greatly accelerated rate of amino-acid substitution
in this group, compared to the 3 substitutions separating mouse and pig.
(From J. H. Gillespie. 1991. The Causes of Molecular Evolution.
Cambridge Univ. Press)
C. Example: Lysozyme
1. Many animals secrete lysozyme C as an antibiotic in saliva that attacks bacterial cell walls.
2. Artiodactyls and langurs have independently recruited lysozyme
into the stomach to aid in digestion
of the bacteria that these animals harbor to break down cellulose and
other plant materials.
3. Novel conditions encountered by lysozyme in artiodactyl and langur stomachs:
a. Presence of digestive enzymes that will break down lysozyme
b. low pH
4. In both artiodactyls and langurs, rate of amino-acid sequence
evolution is accelerated in conjunction
with recruitment of lysozyme into stomach:
Left: phylogeny
of pigs and artiodactyls showing accelerated rate of amino-acid substitution
in lysozyme between points A and B. Right: phylogeny of primates,
showing accelerated rate of substitution in Langurs. (From J. H. Gillespie.
1991. The Causes of Molecular Evolution. Cambridge Univ.
Press)
5. Accelerated evolution is correlated with convergent evolution:
five of the 10 amino-acid substitutions
in langurs have identical substitutions at identical positions in cows.
6. It is highly improbably that such convergence occured by chance;
it therefore must represent
adaptive convergence.
7. That these changes were adaptive suggests that many of the other
changes that occured during the
period of accelerated amino-acid substitution in both artiodactyls and
langurs were also probably
adaptive.
D. Conclusions
1. There are many good examples of adaptive amino-acid substitution,
which suggests that much
protein sequence evolution may be caused by natural selection.
2. However, a list of examples can not definitively prove that most
amino-acid substitutions occur
due to selection rather than drift. One can always claim that the
examples aren't representative and
constitute exceptions to the general pattern.
3. Consequently, other methods must be used to distinguish between
the neutralist and selectionist
views.
IV. Distinguishing between Selectionist and Neutralist Views: Generation Time Effects.
A. Explanation of Generation Time Effect
1. Neutral theory predicts that the rate of protein evolution should
be proportional to the mutation rate,
but, as we've seen, the mutation rate is measured in terms of events/generation
(e.g. mutations/copy/generation).
2. On the other hand, the molecular clock is usually calibrated
in terms of substitutions/year (see above
graphs).
3. The relationship between these two quantitites is
No. substitutions/year = [No. substitutions/generation] / [No. years/generation]
= [No. mutations/generation] / [No. years/generation]
4. The theory says that [No. substitutions/generation] is constant,
because we assume that mutation rate
is more or less constant.
5. Assuming that mutation rate is comparable for most organisms,
then the above relationship says that
the number of substitutions/year should be proportional to [No. years/generation].
6. Another way of saying this is that the substitution rate/year
should be smaller for organisms with
longer generation times, i.e. the molecular clock should tick more slowly
for such organisms.
7. Consequently, if one separately plots No. substitutions vs. years
since common ancestor, as in the
figures above, for two groups of organisms (one group with short generation
times, the other with
long generation times), the relationship should have a higher slope for
the group with the shorter
generation time.
8. The existence of a generation time effect would thus be consistent
with the neutralist theory, whereas
the absence of a generation time effect would be inconsistent with that
theory and thus favor the
selectionist theory.
B. The Data
1. Generation time effect for silent substitutions
a. Silent substitutions are nucleotide substitutions that do not
result in amino-acid substitution.
b. It is generally believed that most silent substitutions are effectively
neutral.
c. Therefore, expect generation time to be exhibited by this class
of substitutions, and can use
this as a control for the existence of a generation-time effect.
d. The table below illustrates the generation time effect for 21
loci from primates, rodents, and
artiodactyls that have been sequenced. In general, the rate of silent
substitutions is
greatest for rodents, which have the shortest generation time, and slowest
for primates,
which have the longest generation time.
Rates of
evolution of silent substitutions. Rates are expressed relative
to the artiodactyl rate. (From J. H. Gillespie. 1991.
The Causes of Molecular Evolution. Cambridge Univ. Press.
Original data from Li. et al. 1987. J. Molec. Evol. 25: 330-342.)
Rates of
silent substitutions for additional animals. (From J. H. Gillespie.
1991. The Causes of Molecular Evolution. Cambridge Univ. Press.)
e. Data from silent sites thus confirms that a generation time effect
exists when substitutions
are caused primarily by genetic drift.
2. Lack of generation time effect for amino-acid substitutions (Wilson
et al. 1977)
a.
Examined amino-acid sequences of 12 proteins for 25 pairs of species.
b.
In each species pair, there was one species (e.g. elephant or grey seal)
that had a long generation time and one species (e.g. mouse) that
had a
short generation time.
c.
Plotted the number of amino-acid substitutions from the common
ancestor for the long-generation time species on one axis and the
corresponding number of substitutions for the short-generation
time
species on the other axis.
d.
In such a plot, the points should lie along the 45-degree line if there
is no
generation time
ancestor should be similar for both species of the pair),
whereas they
should lie along a line of much higher slope if substitution rates
are higher
in the shorter-generation species.
e.
Found that the points lay along the 45-degree line, indicating an absence
of a detectable generation-time effect:
(Figure from M.
Ridley1993. Evolution. Blackwell Scientific Pubs.;
original
data from Wilson et al. 1977. Ann. Rev. Biochem. 46: 573-639)
f. Conclusion: Substitution rates not consistent with neutralist
expectations. Data therefore
supports selectionist view.
g. Caveats: Unfortunately, there are counterarguments that
can reconcile this lack of a
generation time effect with the neutralist view:
i. Organisms with longer generation times tend to be larger and
have smaller
populations. This means that any given mutation with a specified
selection
coefficient s is more likely to be effectively neutral (satisfy 4Ns <<
1) for the
species with longer generation times. This in turn means that the
effective
rate of production of neutral mutations is likely to be higher for those
organisms,
which would tend to counteract the expected generation time effect and
perhaps
completely compensate for it.
ii. Some investigators have argued that longer-lived organisms are
likely to undergo
more germ-line cell divisions per generation, which could also produce
a higher
mutation rate in these organisms that could counteract the expected generation
time effect.
h. Conclusion: Whether lack of a generation time effect is
as serious a blow to the neutralist
theory as some selectionists believe is open to question.
V. Is the molecular clock episodic.
A. The issue
1. Neutral theory has traditionally assumed that amino-acid substitution
caused by drift is a process that
can be modelled as a poisson process.
a. A poisson process is one that behave's like radioactive decay:
b. In any given time interval there is a constant probability that
an event will occur (e.g.
an atomic nucleus will decay; or an amino acid at a particular site in
a protein will be
replaced by another amino acid)
c. The actual number of events per unit time may differ from the
average, and in fact is
given by the Poisson distribution:
Prob[x events in time unit t] = (mx/x!) e-m ,
where m is the average number of events per unit time
interval. (Do not confuse this m with the
m that stands for mutation rate. Unfortunately,
the symbol used is the same.)
d. Characteristic Poisson distributions:
This figure
shows a series of Poisson distributions with different mean numbers of
events per unit time, m . Y-axis is proportion of time units
(samples) with indicated number of events (x axis). (From R. R.
Sokal and F. J. Rohlf. 1969. Biometry. W. H. Freeman.)
2. It is therefore legitimate to ask whether actual amino-acid substitutions
occur according to a Poisson
process.
a. If so, this supports the neutralists' contentions.
b. If not, this may constitute evidence for the selectionists contentions
B. Testing for a Poisson process
1. Whether substitutions follow a Poisson distribution can be tested
by a variety of statistical procedures,
all of which are too complicated for consideration here.
2. However, it is possible to give a heuristic discription of these tests.
3. One approach is to use "star phylogenies": a phylogeny in which
several different lineages converge
virtually simultaneously to the same common ancestor.
4. An example of a star phylogeny is given by the radiation of mammals,
in which a number of different
mammalian families arose at approximately the same time (approximately
140 million yrs ago at the end of
the Jurassic period):
Radiation
of several families of modern mammals from ancestral mammals of the Jurassic.
Numbers indicate estimated numbers of amino-acid substitutions along
indicated lineages for b globin. (From J. H. Gillespie.
1991. The Causes of Molecular Evolution. Cambridge Univ.
Press.; original data from M. Kimura. 1987. J. Mol. Evol.
26: 24-33.)
5. The time since the common ancestor is the same for all lineages in a star phylogeny.
6. Using techniques that we will discuss in a couple lectures, it
is possible to infer, for a particular
protein, the amino acid sequence of the common ancestor, given the sequences
of the modern
descendents.
7. One can then calculate the number of amino-acid substitutions that occured in each lineage.
8. To test whether the molecular clock follows a Poisson process,
one must assume that the underlying
rates of amino-acid substitution are the same for the different lineages.
9. This rate may then be calculated as the average observed rate, averaged over the different lineages.
10. Given this common rate, the Poisson equation above predicts
how much variation among the lineages
in observed substitution rates one should expect due to chance effects.
11. One can then ask whether the actual variation in observed rates
is greater than or less than what
is expected for a Poisson process.
C. The data
1. As an examle, Kimura examined 5 proteins for the star phylogeny depicted in the Figure above.
2. All five exhibited deviations from expectations under the Poisson clock.
Estimates
of an "index of dispersion", R, for a radiation of mammals. R=1 if
amino-acid substitutions follow a Poisson clock. R>1 indicates
overdispersion. (From J. H. Gillespie. 1991. The Causes
of Molecular Evolution. Cambridge Univ. Press.; original data from
M. Kimura. 1983. The Neutral Allele Theory of Molecular Evolution.
Cambridge Univ. Press.)
3. Other studies have obtained similar results:
(Analysis
of applicability of Poisson model to sequence data for 20 loci from rodents,
primates, and artiodactyls. The relevant column of this table is
the one headed by Rr(t), which is comparable to
the index of dispersion from the previous example. In this example,
12 of 20 loci exhibited statistically significant deviations from the Poisson
clock in the direction of overdispersion. (From J. H. Gillespie.
1991. The Causes of Molecular Evolution. Cambridge Univ.
Press.; original data from Li et al. 1987. J. Molec. Evol. 25:
330-342.)
4. In general, most studies have found that observed substitution
rates are overdispersed--there
are more lineages that have high observed rates of substitution and more
lineages that have
low rates of substitution than is expected under a Poisson process.
5. Another way of saying this is that the molecular clock tends
to be episodic: there are short bursts
of several substitutions followed by periods with few substitutions, followed
by a short burst,
and so on.
6. This pattern is easily explained by the selectionists:
adaptations come in bursts in response to
periodic environmental changes experienced by organisms.
7. Neutralists have more difficulty explaining this episodicity--it
can be done by assuming that the
genetic drift gives rise to a substitution process that is more complicated
that a Poisson process,
but the biological justification for making such an assumption is unclear.
D. Conclusion: Episodicity of molecular clock is most easily
explained by assuming adaptation to changed
environments is episodic, but the possibility that genetic drift produces
episodic change can not be
ruled out definitively.
VI. dN/dS Ratios
A. The advent of high-throughput DNA sequencing (as opposed to protein
sequencing) has fostered the development
of analytical approaches to quantifying patterns of DNA variation both
within and among species.
1. Coupled with the development of coalescence
theory (an extension of the neutral theory), these approaches
allow predictions to be made regarding the patterns of nucleotide variation
that are expected under
neutrality vs. the historical operation of natural selection.
2. In the next sections,
we consider some of these predictions, and how they have been tested.
As we shall see,
this line of research has essentially settled the neutralist-selectionist
debate.
B. dN/dS ratio: (see handout)
(No. non-synonymous
substitutions/ No. non-synonymous sites)/ (No. synonymous substitutions/
No. synonymous sites)
1. Under strict neutrality (i.e. most
amino-acid substitutions are due to drift), expect dN/dS = 1.
2. A value of dN/dS < 1
indicates operation of purifying selection, but doesn't rule out operation
of repeated positive selection.
3. A value of dN/dS > 1 indicates
the operation of repeated positive selection.
C. These relations suggest that evidence for repeated positive
selection can be obtained from cases in which dN/dS > 1.
D. Example: evolution of chitinase in Arabidopsis (Arabis) and close relatives. (Bishop, Dean and Mitchell-Olds, 2000, PNAS 97: 5322-5327)
1. Chitinase is a defensive
enzyme in plants that breaks down cell walls of fungal pathogens.
2. Sequenced chitinase from 15 species
of Arabis and one species of the related genus Brassica.
3. Calculated and plotted
dN vs. dS (or Ka vs. Ks) for all pairs of species:
4. These results indicate that a
substantial number of comparisons exhibit a dN/dS ratio significantly >
1,
pointing to repeated positive selection as the cause of
many of the amino acid substitutions.
5. In addition, the spatial pattern
of substitutions also points to positive selection:
a. For class
1 Chitinases (the type examined in this study), the per-site rate of amino-acid
substitution
is the same for sites within and
outside of the enzyme's active site.
b. By contrast,
for other enzymes (e.g. Lysozyme and Class III chitinases), the per-site
of aa substitution
is much smaller in the active site.
c. It
is possible, using techniques that cannot be described here, to infer WHICH
amino-acid substitutions
were due to positive selection and
which to drift. For the chitinases in this study, approximately
15% of the substitutions were inferred
to be due to selection. ALL OF THESE OCCURRED
IN THE ACTIVE SITE OF THE
ENZYME, a highly non-random pattern.
d. This
pattern suggests that coevolution between plants and pathogens drives adaptive
amino-acid substitution
in chitinase.
6. Similar examples of dN/dS >
1 have been found in a variety of organisms. Interestingly, most
of these
examples involve proteins that are
involved in male reproductive characters or gamete recognition.
Together, these examples demonstrate
that repeated positive selection may be a common phenomenon.
However, these cases admittedly
do not involve a random sample of proteins, and so they do not
constitute evidence that most protein
evolution is due to selection rather than genetic drift.
A. What we have seen so far is that varius approaches to distinguishing
between neutralist and
selectionist theories have been problematical.
B. What is exciting is that over the past five to ten
years, new techniques, based on analysis of DNA
sequence variation, have been developed that in principle can resolve
this question.
C. The prototype study using these techniques was performed
by McDonald and Kreitman and focused
on attempting to determine the extent to which natural selection and genetic
drift have been
responsible for protein sequence divergence at the Adh locus between
Drosophila species.
1. Like most other proteins, between-species comparisons of the Adh
sequence show fixed
differences, indicating evolutionary change in protein primary structure.
2. Like most proteins, Adh is also variable within populations,
both at silent sites and also
in amino-acid composition.
3. The McDonald-Kreitman test uses this variation within populations
to deduce the nature of
the evolutionary processes that have caused the between-species differences.
D. Logic underlying the M-K test.
1. Suppose that the neutralists are right, that virtually all evolutionary
change in proteins is due
to genetic drift involving neutral alleles.
2. This supposition implies that the fixed amino-acid differences
between species are due to
genetic drift, and we can therefore make the following prediction:
a. Divide both within-species and between-species variation in nucleotide
sites into two
categories
i. those due to non-synonymous nucleotide substitutions, and
ii. those due to synonymous substitutions
b. Then the ratio of synonymous to non-synonymous substutions should
be the same
between species as it is within species.
3. Explanation of this prediction:
a. Let the mutation rate per site for neutral synonymous mutations
be ms
b. Let the mutation rate per site for neutral non-synonymous mutations
be mn
c. Then, because both classes of mutations are governed by genetic
drift, they
have the same probabilities of being eliminated, fixed, or polymorphic
at any given
time after they arise.
d. This means that the ratio of synonymous to non-synonymous
polymorphic sites
at any time should be
(No. of synonymous sites polymorphic/gene)/(No. of non-synonymous sites polymorphic/gene) = msKs / mnKn (1)
where Ksand Kn are the number of synonymous
and non-synonymous sites in the gene.
e. Next, consider fixed differences between two species.
f. Fixed differences at either synonymous or non-synonymous sites
arise due to
divergence from the sequence of the common ancester.
g. Consider the expected number of synonymous sites differing between
that ancestor
and one of the species. This is given by
No. differences/gene = (No. mutations/gene/time) x (divergence time) x (prob a mutation is fixed)
= ( msKs ) x t x (1/2N)
= tmsKs / 2N
h. Now, assuming the mutation rate and population size of the other
species are similar,
the expected number of synonymous sites differing between that species
and the
ancestor is also tmsKs / 2N.
j. Consequently, the expected number of synonymous differences
between the two
living species is just the sum of these two numbers, i.e.
Expected No. synonymous differences/gene = tmsKs / 2N + tmsKs / 2N = tmsKs / N (2)
k. A similar line of reasoning indicates that the expected number
of non-synonymous
differences between the two living species is just
Expected No. non-synonymous differences/gene = tmnKn / N (3)
l. Using these results (Equations (2) and (3) ), the expected ratio
of synonymous to
non-synonymous fixed differences between the two species is just
No. synonymous differences/geneNo. non-synonymous differences/gene = msKs / mnKn (4)
m. Comparing Equations (1) and (4), it is clear that, assuming
most amino-acid
differences between species are due to drift,
No. synonymous differences between species/gene
=
No. non-synonymous
differences between species/gene
No. of synonymous sites polymorphic within species/gene
(5)
No. of non-synonymous sites polymorphic within species/gene
n. In other words, under the neutralist view, in which most
polymorphisms as well as
most fixed differences in proteins are selectively neutral, the ratio of
variable
synonymous sites to variable synonymous sites should be the same for
within-species polymorphisms and for fixed-differences between species.
o. Suppose, however, that most differences between species in non-synonymous
sites are
due to selection fixing advantageous mutants.
i. These show up as fixed differences between species, but they don't
show up
as non-synonymous polymoprhisms within a species because they are rapidly
fixed soon after they arise.
ii. This effectively means that the denominator of the right-hand
side of
equation (5) is less than the situation that pertains under strict neutrality
of
non-synonymous mutations.
iii. Consequently, if selection is responsible for a substantial
proportion of
fixed non-synonymous differences, we should have
No. synonymous differences between species/gene
No. non-synonymous
differences between species/gene <
No. of synonymous sites polymorphic within species/gene
No. of non-synonymous sites polymorphic within species/gene
(6)
p. Thus, the selectionist and neutralist views can be distinguished
on the basis of
equations (5) and (6).
E. The data for Adh
1. McDonald and Kreitman applied this type of analysis to variation
at the Adh (alcohol
dehydrogenase) locus in D. melanogaster and D. simulans.
2. They sequenced 12 Adh alleles from D. melanogaster
and six alleles from D. simulans.
3. Found the following results for number of differences within and
between species:
Site type
Within species Between
species
Synonymous
42
17
Non-synonymous
2
7
4. We thus have
No. synonymous differences between species/gene
No. non-synonymous
differences between species/gene = 17/7 = 2.43
<<
21 = 42/2 = No. of synonymous sites polymorphic
within species/gene
No. of non-synonymous sites polymorphic within species/gene
5. In other words, equation (6) is correct and equation (5) is not.
6. The pattern of variation at the Adh locus thus supports
the notion that a large fraction of the
between-species differences at non-synonymous sites are due to natural
selection fixing
advantageous mutants.
F. M-K tests on a genomic scale
1. As with previous analyses described above, one or a few significant
cases of repeated positive selection
as revealed by the M-K test does not permit a generalization about the
role of selection in protein
evolution.
2. Recently, however, with the advent of
high throughput sequencing, it has been possible to examine a large
sample of proteins using the MacDonald-Kreitman approach.
a. Fay
et al. 2002. Nature 415: 1024-1026. M-K approach
using 45 different genes from Drosophila melanogaster and
D. simulans :
Polymorphism
Divergence
Non-syn. substitutions
65
598
Syn . substitutions
224
950
--Under neutral
evolution, expect only 276 non-synonymous substitutions in divergence between
the two species, rather
than the observed 598. Indicates an excess of
non-synonymous substitions compared to neutral expectation,
and these substitutions must have been due to postitive
selection. Best estimate is that 54% of subistutions between
the two species were caused by natural selection.
b. Fay
et al. 2001. Genetics 158: 1227-1234. In
a similar study on humans and primates, found that 35% of amino-acid substitutions
were caused by selection.
c. Smith and Eyre-Walker. 2002. Nature 415: 1022-1024. Similar study comparing Drosophila simulans and D. yakuba. Found
that 45% of amino-acid substitutions
were due to selection. They estimated that one adaptive substitution
has occurred
in these lineages
once every 45 years, on average.
VIII. Conclusions
1. Based on current evidence,
which represents a good more-or-less random sample of proteins from Drosophila
and from Primates, approximately
45% of amino-acid substitutions are adaptive, i.e.
caused by natural selection. The remainder are caused by genetic
drift.
2. This conclusion supports
the Selectionist viewpoint more than the Neutralist viewpoint. Selectionists
did not deny that some substitutions were due
to genetic drift, just not a large majority. Neutralists,
by contrast, argued that adaptive substitution was very rare and that
most substitutions
are caused by drift. This clearly is not the
case.
IX. Assessing Historical Patterns of Selection: Polymorphism at the Adh locus
A. Reconstructing history
1. The previous example, besides demonstrating how analysis of sequence
data can address
the neutralist-selectionist controversy, illustrates how such data can also
be used to
infer characteristics about past evolutionary processes (e.g. that
evolutionary change of
amino-acid sequence of the Adh locus has been governed largely by
selection rather than
drift).
2. This is just one illustration of the power of sequence data
to reveal the nature of historical
evolutionary processes.
3. Beginning today, we shall examine other methods for reconstructing
evolutionary history,
4. Today, however, we will continue to examine ways of detecting
past history of selection.
B. The Adh locus.
1. Alcohol dehydrogenase is an enzyme that metabolizes alcohol.
Since alcohol is a common
component of rotting fruit, a common larval habitat of Drosophila, its proper
functioning
is necessary for avoiding alcohol poisoning in developing larvae.
2. Drosophila melanogaster is distributed worldwide, and throughout
its range it exhibits
a polymorphism for electrophoretic variants.
3. The variants come in two classes, Fast and Slow, names indicative
of the rates at which
these variants migrate on an electrophoretic gel.
4. The difference between Fast and Slow alleles is due to the
substitution of threonine in the
Fast alleles for lysine in the Slow alleles at position 1490 of the Adh
gene.
5. Within an allele class (Fast or Slow), there is a great deal
of variation at other nucleotide
sites, so that the nucleotide sequences of all Fast alleles are not the same,
nor are the
sequences of all Slow alleles.
6. Virtually all of this variation within an allele class is
silent, however, i.e. it does not result
in an amino-acid substitution.
7. Because of the world-wide distribution of the Fast-Slow polymorphism,
it has been
conjectured that this is an ancient polymorphism actively maintained by some
form
of "balancing selection" .
a. Balancing selection is any process that prevents either allele from
being eliminated.
b. Heterozygote superiority (e.g. sickle-cell
polymorphism) is one type of balancing
selection
c. Frequency-dependent selection (e.g. selection
maintaining tristyly in plants) is
another type of balancing selection
C. Equilibrium levels of silent-site variation under genetic drift without balancing selection
1. In this and the next sections, it is shown that long-term balancing
selection should result in
excess silent-site variation at sites immediately surrounding the site of
the Fast-Slow
polymorphism, an excess that should not be present if there has not been
long-term balancing
selection.
2. Here we first consider the accumulation of genetic variation in
the absence of balancing
selection.
3. Think of a stretch of DNA, i.e. a gene. Let us say for convenience
that this gene is 1000 base
Consider the silent sites in this gene, at which variants are selectively
neutral and hence
governed by genetic drift.
4. Suppose initially all copies of this gene are identical in a population.
5. Over time, mutations will arise at many of the silent sites
6. Because the fate of these mutations will be governed by genetic
drift, some of the mutant
alleles will drift to intermediate frequencies, while others will be eliminated.
7. After a short time, then, there will be several sites that are variable in the population.
8. At a later time, more mutations will have arisen, increasing the number of variable sites.
9. On the other hand, some of the variable sites present earlier will
become fixed due to genetic
drift, which will decrease the number of variable sites.
10. These two tendencies (8 and 9) oppose each other, and will eventually
come to balance one
another: at some point, on average, the increase in the number of variable
sites due to
mutation will be just counterbalanced by the decrease due to fixation,
yielding an
equilibrium.
11. This equilibrium can be described graphically by a simple model:
a. We plot first the number of new sites becoming variable per
unit time.
b. Since sites that are already variable can not become new variable
sites,
the number of sites becoming newly variable per unit time should be
proportional to the number of sites that are not already variable.
c. Consequently, the number of sites becoming newly variable
is a
decreasing function of the number of sites already variable, as shown in
the Figure below.
d. Next we plot the number of variable sites that become fixed
per unit time.
e. Since only sites that are currently variable can become fixed,
the number
of sites becoming fixed should be proportional to the number of sites that
are currently variable.
f. Consequently, the number of sites becoming newly fixed is
an increasing function of the number
of sites already variable, as shown in the Figure.
12. An alternative way of portraying this equilibrium is to plot the number of sites variable vs. time
a. Starting with a situation at which there are just a few variable
sites, the population
will be at point 1 on the time axis of the left figure:
b. Because the number of new variable sites created is greater than
the number of variable sites fixed (green line),
there will be an increase in the number of variable sites equal to the difference
between the red and blue lines
(equal to length of green line). This is portrayed by the number of
variable sites at time 1 in the right graph.
c. Going back to the left graph, the population at time 2 (difference
between point 1 and point 2 on x-axis equals the
height of the green line at point 1, which is also equal to the horizontal
teal line starting at point 1) still experiences
a higher number of variable sites created than variable sites fixed, though
the magnitude of the difference
(green line at point 2) is less than in the previous time period.
d. The total number of variable sites is increased by this difference,
and plotted on the right graph as the point
corresponding to 2 on the x-axis.
e. This process is repeated for subsequent points, generating the red
line on the right graph.
f. Conclusion: The number of variable sites increases at
first, but then levels off at an equilibrium.
13. This analysis indicates that by themselves, the processes of drift
and mutation lead to a balance that puts an upper
limit on the amount of variation in a gene, as measured by number of variable
silent sites.
D. Levels of silent-site variation with balancing selection but no recombination
1. Next, consider what is expected to happen to number of variable
silent sites if there is at a non-synonymous
site in the gene that is kept polymorphic by balancing selection (as is hypothesized
for the amino-acid
polymorphism in Adh).
2. Suppose that initially the population starts out with the alleles
at the non-synonymous site at roughly equal
frequencies, and that at all other sites these two alleles are identical.
3. For the time being, also assume that there is no recombination between these two alleles.
4. As before, we expect mutations to arise at silent sites in both alleles, and for some of these mutations drift to fixation.
5. The key thing to recognize though, is that different sites will mutate
and become fixed on the different alleles.
6. Over time, more and more mutations will become fixed, again, with
different mutants becoming fixed on the
two allele types.
7. In this case, however, there is no process analagous to the fixation
we saw earlier to counteract this steady build
up of differences between the two allele types: while new mutants become
fixed with one allele type, this increases,
rather than decreases, the number of silent sites that differe between the
two alleles.
8. The two alleles will thus gradually and steadily diverge over time
in the number of sites that differ. Another
way of saying this, is that if we compare pairwise sites on one allele with
sites on the other allele, the number of
sites that are variable will increase steadily with time (blue line in figure
below):
E. Levels of silent-site variation with balancing selection and recombination
1. In the previous section we assumed that there is no recombination between the two allele types.
2. This is likely to be a good approximation near the site of the amino-acid substitution (the site of balancing selection).
3. Away from that site, however, recombination is likely to be common.
4. Consider, in particular the opposite extreme: a stretch of DNA far
removed from the amino-acid substitution, so that
recombination occurs freely between this strech and the site of amino-acid
substitution.
5. Then the situation is exactly the same as considered originally
(section C above): only drift and mutation are
operating. What's going on in the region near the amino-acid substitution
has no influence on this region because
recomgination between the two regions essentially decouples them.
6. Consequently, in the region far from the amino-acid substitution,
comparing pairwise variation at sites corresponding
to the two alleles is just like comparing two DNA sequences at random from
a population in which there is no
selection.
7. We therefore expect an equilibrium level of silent-site variation
between allele types to be established--an equilibrium
level that will be far lower than the level of silent-site variation built
up in the area surrounding the amino-acid
substitution.
8. There are thus two extremes: what happens in an area of no
recombination around a site of balancing selection, and
what happens in an area far removed from that site.
9. Inbetween, the level of variation is between these two extreems,
as might be expected: As one moves from the site
of balancing selection to the distant site (e.g. toward free recombination),
the effects of recombination become more
and more dominant, and the amount of silent-site variation decreases toward
the equilibrium level at the distant site.
10. There is thus expected to be a peak of silent-site variation centered
on the site of balancing selection; as one moves
away from this site in either direction, the amount of variation is expected
to decay to the equilibrium level:
1. To estimate levels of silent-site variation at different distances
from the Adh amino-acid polymorphism, Kreitman and
Hudson sequenced 11 copies (5 Fast alleles and 6 Slow alleles) of the region
surrounding and including the Adh
regions in D. melanogaster:
Boxes represent
exons, lines between boxes represent introns. Numbers below the
figures indicate
the number of polymorphic sites, and number of fixed differences between
D. melanogaster and D. similans, within
indicated gene regions. Adh: the Adh gene. Adh-dup:
a duplication of the Adh gene, though to be a pseudogene (From M.
Kreitman and R. R. Hudson. 1991. Genetics 127: 565-582).
2. After aligning these sequences, they compared each Slow allele with each Fast allele in the following way:
a. They marked of a "window" beginning at the 5' end of the sequences
and extending in the 3' direction far enough
to include 100 silent sites.
b. Within this window, they tabulated the number of sites that differed
between the two sequences.
c. The window's 5' end was then moved to the right (3' direction)
by one silent site, and the 3' end was extended to
include an additional silent site.
 d
between the two sequences was then tabulated for that window.
e. procedures (c) and (d) were repeated until the entire sequence
had been examined.
f. The data for a pair of sequences thus consisted of a set of
numbers, corresponding to the number of variable sites
within each window.
3. The number for a particular window was then averaged over
all pairs of Fast and Slow alleles to yield an average
number of polymorphic silent sites for each window.
G. The results
1. Using the data just described, the authors plotted number of variable
silent sites vs. position of the corresponding
window (where that position is devined by the midpoint location of the window:
Variability
at silent sites in the region of the D. melanogaster Adh locus
portrayed using the sliding window technique. The observed variability
shows a marked peak centered on the site of amino acid substitution between
Fast and Slow alleles (arrow). The expected curve portrays the amount
of variability expected under the assumption of no balancing selection.
(From M. Kreitman and R. R. Hudson. 1991. Inferring the evolutionary
histories of the Adh and Adh-dup loci in Drosophila
melanogaster from patterns of polymorphism and divergence.
Genetics 127: 565-582)
2. The data clearly demonstrate the existence of a peak of silent-site
variation centered on the position of the amino-acid
substitution responsible for the Fast-Slow polymorphism.
H. Conclusions
1. The pattern of silent-site variation in the region of the Fast-Slow
polymorphism deviates markedly from the pattern
expected in the absence of balancing selection at that polymorphism.
2. The deviation is in the direction expected if balancing selection
has been acting on that polymorphism for extended
periods.
3. The method used provides a powerful way of detecting the operation of long-term balancing selection.
X. Assessing Historical patterns of selection: Other patterns
A. Selective sweeps
1. Consider a non-synonymous, advantageous mutation that arises in
a population and is fixed rapidly (within a few
hundred generations)--a "selective sweep"
2. At the time of fixation all copies of that allele are recently descended from the original mutant copy.
3. There will not have been much time for mutations to accumulate on
any of those copies in the short time since the
mutation arose.
4. Consequently, all copies will be identical not only at the
selected site, but at most other sites, and in particular, at most
silent sites.
5. The amount of silent-site variation will thus be below the equilibrium level discussed in section II. C. above.
6. This depletion of variation will drop off as one moves away
from the selected site, however, due to the effects of
recombination, just as we saw previously.
7. The result is that the pattern of silent-site variation characterizing
recent selective sweeps is a local depression of
variation in the region immediately surrounding the selected site--the opposite
of what is seen with balancing selection: