Principles of Evolution Web Page
                        Lectures
 



LECTURES 13, 14, 15 and 16:    MOLECULAR EVOLUTION

I.  The Molecular Clock

    A.  Example of molecular evolution: cytochrome C

    B.  Rates of Amino Acid sequence evolution--Zuckerkandl and Pauling

        1.  In the early 1960's, protein sequencing had recently been invented, and this allowed
             Zuckerkandl and Pauling to compare the sequences of homologous genes across a
             number of different taxa.

        2.  One obvious pattern that emerged was that closely related taxa had relatively similar sequences for a
             given protein, while more distantly related taxa had more divergent sequences for the same protein.

        3.  This pattern forms the basis for phylogeny reconstruction using molecular sequences, which is
             considered in the next lecture.

        4.  A second pattern that became evident was that protein sequences seemed to diverge at a more-or-
             less constant rate, which gave rise to the notion of the "molecular clock".

    C.  Example: molecular clock for  a globin.

        1.  Data collection:
 
            a.  Compare sequences of pairs of species and determine the number of sites at which the two
                  species of a pair differ in amino-acid composition (number of substitutions).
            b.  For each pair, also estimate (from paleontological data) the time since the common ancestor
                 of the pair.
            c.  Plot, for each pair, number of substitutions separating the pair vs. time since separation.

        2.  The figure below plots this data for a globin.


 a.  Time axis is in millions of years
 b.  Each point represents a comparison
      of the indicated species with humans.
 
 
 
 
 
 
 
 
 
 
 

        3.  Conclusion:  Although there is some scatter about the line, the data strongly suggest that number
       &nbsng a species pair is linearly proportional to the amount of time those species
             have been diverging.

        4.  The linearity indicates that the rate of divergence is approximately constant, i.e. the molecular clock
             ticks with a constant rate.

    D.  Another example.

        1.  One is not restricted to examining just one protein to estimate the relationship between number of
             substitutions and divergence time.

        2.  In this example, Walter Fitch combined the data from seven different gene products (cytochrome c,
             fibrinopeptides A and B,  and  hemoglobins, the insulin C-peptide, and myoglobin) and compared
             sixteen pairs of mammal species:
 

        3.  Again, the linearity, as well as the tightness, of the relationship between divergence time and number
             of substitutions is striking.

    E.  Generality of the phenomenon:

        1.  The constant divergence rate of protein sequences is the rule for virtually all proteins and taxa that
             have been examined.

        2.  A phenomenon this universal demands some sort of general evolutionary explanation.

II.  Explanations for the Molecular Clock--The Neutralist and Selectionist Views

    A.   The Neutralist View.

        1.  The Neutralist view maintains that most protein evolution is due to genetic drift.

        2.  Specifically, the neutralist view asserts the following points, with regard to non-synonymous,
             or replacement, mutations:

            a.  Most mutations are deleterious, and are therefore eliminated by natural selection as soon
                 as they occur.
            b.  Some mutations have negligible effect on fitness, and are therefore effectively neutral. A small
                 but real fraction of these mutations will be fixed by genetic drift and thus cause evolutionary
                 change in protein sequence.  (Remember, an allele need not have strictly no selective
                 advantage or disadvantage for its frequency to be governed by drift;  the strength of selection
                 just must be small, i.e. satisfy the criterion 4Ns <<1.)  This fixation of effectively neutral
                 alleles by drift accounts for most protein evolution.
            c.  Mutations that are advantageous are very rare, because organisms are close to optimally
                 adapted and therefore very few "improvments" can be made.

    B.  The Selectionist View

        1.  The Selectionist view maintains that most protein evolution is due to natural selection.

        2.  Specifically, the Selectionist view maintains asserts the following points with regard to non-
             synonymous mutations:

            a.  Most mutations are deleterious, and are therefore eliminated by natural selection as soon
                 as they occur.
            b.  Relatively few mutations have negligible effects on fitness, and are thereby effectively
                 neutral and governed by drift.
            c.  A substantial number of mutations are selectively advantageou and are fixed by natural
                 selection when they arise.  These mutations account for most protein evolution.
 
     C.  Comparison of Neutralist and Selectionist Views

        1.  Both views agree that natural selection very commonly eliminates deleterious mutations when they
             arise--a phenomenon known as "purifying selection".

        2.  The views disagree on the relative abundance of neutral vs. selectively advantageous mutations.

        3.  Since only these latter two types of mutations can cause evolutionary change in protein sequences,
             the two views disagree about the relative importance of genetic drift and natural selection in causing
             protein evolution.

        4.  Nature of the disagreement

            a.  It is easiest to understand this disagreement in terms of enzyme function (remember many,
                 if not most, proteins are enzymes).
            b.  An enzyme is characterized not only by the identity and ordering of its amino acids, but
                 also by a complex three-dimensional structure.
            c.  Embedded within that three-dimensional structure are pockets or cavities known as
                 active sites, where the enzyme interacts with its substrates.  The precise configuration
                 of the active site is usually crucial for proper enzyme functioning.  The active site normally
                 involves only a small fraction of all the amino acids in the enzyme.
 


Three-dimensional structure of the enzyme hexokinase.  In (a) the active site of the enzyme lies in the center groove.  (b) portrays the substrate, glucose, in the active site.  Note that complexing with substrate causes a change in enzyme configuration.  (From N. A. Campbell.  1990.  Biology.  Benjamin/Cummings.)
 
 
 
 
 
 
 

            d.  The remainder of the three-dimensional structure of the enzyme may also be important
                 for enzyme functioning, e.g the surface charge of the enzyme may reflect it's solubility
                 properties.
            e.   We are fairly ignorant, however, about the general effect of an amino acid substitution in
                  positions away from the active site on the functioning of the enzyme.
            f.   This ignorance provides plenty of room for selectionist and neutralists to argue about the
                 effects of such substitutions.
            g.  On the one hand, the neutralists maintain that substitutions away from the active site,
                 particularly on the inside of an enzyme away from the outer surface, are not likely to cause
                 significant changes in the chemical properties of the enzyme, and thus are not likely to effect
                 its functioning.
            h.  For this reason, they assert that most amino-acid substitutions are neutral.
            i.   By contrast, the selectionists assert that many amino acid substitutions will cause changes
                 in the chemical properties of enzymes, simply by virtue of change in charge or polarity.
                And of those that don't alter these characteristics,  virtually all,
                even those far away from the active site, have the potential to cause, for example,
                slight changes in "tension" in the protein chain, which
                may ramify down the chain to the active site and distort it.  In most cases these effects
                will be detrimental, but sometimes, especially when a lineage has recently experienced an
                environmental change, they will be beneficial and therefore favored by natural selection.
            j.  The basic nature of the disagreement between selectionists and neutralists thus resides in
                their views of how mutations are expected to affect enzyme functioning in general--a
                disagreement that is not easily open to direct experimental assessment.
 

    D.  Explanation of the Molecular Clock by the Neutralists.

        1.  The molecular clock follows directly from the properties of genetic drift, as long as one additional
             assumption is made:  the rate of mutation remains more or less constant over time.

        2.  Derivation:

            a.  Let N be the number of individuals in a population, and  be the mutation rate (i.e.
                 number of mutations/gene copy/generation).
            b.  Typical mutation rates for genes are on the order of 10-6  - 10-4/gene copy/generation.
            c.   The number of mutations arising per generation in a population is given by

                    No. new mutations = (No. copies of gene) x mutation rate

                                                  =  2Nm.
            d.  The initial probability that a newly arisen neutral mutation will be fixed by genetic drift
                 is just its initial frequency, i.e.

                    Prob. of fixation = 1/2N.

            e.  The rate of fixation of new mutations is just

                    Mutations fixed/generation = (New mutations/generation)  x    (Prob. of fixation)

                                                             = (2N x  m) x  (1/2N)

                                                             =    m .

            f.   This remarkable result says that for alleles governed by drift, the rate of fixation of mutations
                 is equal to the mutation rate.
            g.  Assuming constant mutation rates over time, this result says that the rate at which novel
                 neutral mutations accumulate is constant, i.e. there should be a molecular clock.

    E.  Explanation of the Molecular Clock by the Selectionists.

        1.  Those who believe that most amino acid substitutions are favored by natural selection do not have
             such an elegant explanation for the molecular clock.

        2.  Rather, they simply postulate that for some reason, averaged over long geological time spans, the
             rate at which natural selection fixes new amino acid substitutions is rougly constant.

        3.  According to this view, there may be rate homogeneities over short geological periods (e.g on the
             order of thousands to tens of thousands of generatations), but these fluctuations average out over
             longer periods of time (millions of generations) to produce an apparent constancy of fixation rate
             over this time scale.

        4.  Of course, although this explanation does not follow logically from a few simple assumptions like the
             neutralists' argument, it may of course be correct.

    F.  Conclusions

        1.  Both the neutralists and the selectionists have an explanation for the molecular clock.

        2.  Which explanation is correct will depend more generally on whether the neutralists' or selectionists'
             view of the cause of protein sequence evolution is correct.

        3.  We therefore now turn to consideration of evidence bearing on this question.
 

III.  Distinguishing between Selectionist and Neutralist Views: Examples of Adaptation

    A.  John Gillespie has argued that there are numerous examples of amino-acid substitutions that are clearly
         adaptive, and that this evidence argues strongly in favor of the notion that many, if not most amino-acid
         substitutions have been caused by natural selection.

    B.  Example:  Evolution of insulin in rodents.

        1.  In most animals, insulin in storage form is a hexamer, which is made up of three dimers; each
             pair of monomers in a dimer is  held together by a Zinc ion that binds to a Histidine at
             position B10.


Schematic portrayal of pig insulin monomer.  The location of some of the amino-acid changes separating pigs from guinea pigs are indicated.  Note that many of the changes have occurred in the region of dimer contact. (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press)
 
 
 
 
 
 
 
 
 
 
 

        2.  In histricomorph rodents (guinea pigs and relatives), however, the storage form of
             insulin is a monamer.

        3.  Although the evolutionary reason for this change is not understood, it is clear that
             a number of amino-acid substitutions have occurred that act to stabilize insulin in
             monomeric form

        4.  Most importantly, the B10 Histidine has been substituted by other amino acids, so that
             Zinc is no longer able to bind the monomers.

        5.  In addition, there have been quite a few substitutions of more hydrophilic for less hydrophilic
             amino acids in the regions in which the dimers contact.  These substitutions help prevent
             the monomers from coming close enough together to form dimers.

        6.  In short, these substitutions all occur on the surface of the molecular and make the monomers
             more thermodynamically stable in an aqueous medium.

        7.  These changes have resulted in a greatly accelerated rate of evolutionary change in insulin
              protein sequence within the histricomorph rodents.


Numbers of amino acid substitutions separating the insulin sequence of several species.  The Histricomorph rodents are those surrounded by the dashed line.  Note the greatly accelerated rate of amino-acid substitution in this group, compared to the 3 substitutions separating mouse and pig.  (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press)
 
 
 
 
 

 
 

    C.  Example: Lysozyme

        1.  Many animals secrete lysozyme C as an antibiotic in saliva that attacks bacterial cell walls.

        2.  Artiodactyls and langurs have independently recruited lysozyme into the stomach to aid in digestion
             of the bacteria that these animals harbor to break down cellulose and other plant materials.

        3.  Novel conditions encountered by lysozyme in artiodactyl and langur stomachs:

            a.  Presence of digestive enzymes that will break down lysozyme
            b.  low pH

        4.  In both artiodactyls and langurs, rate of amino-acid sequence evolution is accelerated in conjunction
             with recruitment of lysozyme into stomach:

Left: phylogeny of pigs and artiodactyls showing accelerated rate of amino-acid substitution in lysozyme between points A and B.  Right:  phylogeny of primates, showing accelerated rate of substitution in Langurs. (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press)

        5.  Accelerated evolution is correlated with convergent evolution: five of the 10 amino-acid substitutions
             in langurs have identical substitutions at identical positions in cows.

        6.  It is highly improbably that such convergence occured by chance; it therefore must represent
             adaptive convergence.

        7.  That these changes were adaptive suggests that many of the other changes that occured during the
             period of accelerated amino-acid substitution in both artiodactyls and langurs were also probably
             adaptive.

    D.  Conclusions

        1.  There are many good examples of adaptive amino-acid substitution, which suggests that much
             protein sequence evolution may be caused by natural selection.

        2.  However, a list of examples can not definitively prove that most amino-acid substitutions occur
             due to selection rather than drift.  One can always claim that the examples aren't representative and
             constitute exceptions to the general pattern.

        3.  Consequently, other methods must be used to distinguish between the neutralist and selectionist
             views.
 

IV.  Distinguishing between Selectionist and Neutralist Views: Generation Time Effects.

    A.  Explanation of Generation Time Effect

        1.  Neutral theory predicts that the rate of protein evolution should be proportional to the mutation rate,
             but, as we've seen, the mutation rate is measured in terms of events/generation
             (e.g. mutations/copy/generation).

        2.  On the other hand, the molecular clock is usually calibrated in terms of substitutions/year (see above
             graphs).

        3.  The relationship between these two quantitites is

                   No. substitutions/year =  [No. substitutions/generation] / [No. years/generation]

                                                     =  [No. mutations/generation] / [No. years/generation]

        4.  The theory says that [No. substitutions/generation] is constant, because we assume that mutation rate
             is more or less constant.

        5.  Assuming that mutation rate is comparable for most organisms, then the above relationship says that
             the number of substitutions/year should be proportional to [No. years/generation].

        6.  Another way of saying this is that the substitution rate/year should be smaller for organisms with
             longer generation times, i.e. the molecular clock should tick more slowly for such organisms.

        7.  Consequently, if one separately plots No. substitutions vs. years since common ancestor, as in the
             figures above, for two groups of organisms (one group with short generation times, the other with
             long generation times), the relationship should have a higher slope for the group with the shorter
             generation time.

        8.  The existence of a generation time effect would thus be consistent with the neutralist theory, whereas
             the absence of a generation time effect would be inconsistent with that theory and thus favor the
             selectionist theory.

    B.  The Data

        1.  Generation time effect for silent substitutions

            a.  Silent substitutions are nucleotide substitutions that do not result in amino-acid substitution.
            b.  It is generally believed that most silent substitutions are effectively neutral.
            c.  Therefore, expect generation time to be exhibited by this class of substitutions, and can use
                 this as a control for the existence of a generation-time effect.
            d.  The table below illustrates the generation time effect for 21 loci from primates, rodents, and
                 artiodactyls that have been sequenced.  In general, the rate of silent substitutions is
                 greatest for rodents, which have the shortest generation time, and slowest for primates,
                 which have the longest generation time.

Rates of evolution of silent substitutions.  Rates are expressed relative to the artiodactyl rate.  (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press.  Original data from Li. et al. 1987.  J. Molec. Evol. 25: 330-342.)
 
 
 
 
 

Rates of silent substitutions for additional animals.  (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press.)
 
 
 
 

 
            e.  Data from silent sites thus confirms that a generation time effect exists when substitutions
                 are caused primarily by genetic drift.

        2.  Lack of generation time effect for amino-acid substitutions (Wilson et al. 1977)
 
 


a.  Examined amino-acid sequences of 12 proteins for 25 pairs of species.
b.  In each species pair, there was one species (e.g. elephant or grey seal)
     that had a long generation time and one species (e.g. mouse) that had a
     short generation time.
c.  Plotted the number of amino-acid substitutions from the common
     ancestor for the long-generation time species on one axis and the
     corresponding number of substitutions for the short-generation time
     species on the other axis.
d.  In such a plot, the points should lie along the 45-degree line if there is no
     generation time      ancestor should be similar for both species of the pair), whereas they
     should lie along a line of much higher slope if substitution rates are higher
     in the shorter-generation species.
e.  Found that the points lay along the 45-degree line, indicating an absence
     of a detectable generation-time effect:

(Figure from M. Ridley1993.  Evolution.  Blackwell Scientific Pubs.;
  original data from Wilson et al.  1977.  Ann. Rev. Biochem. 46: 573-639)
 
 
 

            f.  Conclusion:  Substitution rates not consistent with neutralist expectations.  Data therefore
                supports selectionist view.
            g.  Caveats:  Unfortunately, there are counterarguments that can reconcile this lack of a
                 generation time effect with the neutralist view:

                i.  Organisms with longer generation times tend to be larger and have smaller
                    populations.  This means that any given mutation with a specified selection
                    coefficient s is more likely to be effectively neutral (satisfy 4Ns << 1) for the
                    species with longer generation times.  This in turn means that the effective
                    rate of production of neutral mutations is likely to be higher for those organisms,
                    which would tend to counteract the expected generation time effect and perhaps
                    completely compensate for it.

                ii.  Some investigators have argued that longer-lived organisms are likely to undergo
                    more germ-line cell divisions per generation, which could also produce a higher
                    mutation rate in these organisms that could counteract the expected generation
                    time effect.

            h.  Conclusion:  Whether lack of a generation time effect is as serious a blow to the neutralist
                 theory as some selectionists believe is open to question.
 

V.  Is the molecular clock episodic.

    A.  The issue

        1.  Neutral theory has traditionally assumed that amino-acid substitution caused by drift is a process that
             can be modelled as a poisson process.

            a.  A poisson process is one that behave's like radioactive decay:
            b.  In any given time interval there is a constant probability that an event will occur (e.g.
                 an atomic nucleus will decay; or an amino acid at a particular site in a protein will be
                 replaced by another amino acid)
            c.  The actual number of events per unit time may differ from the average, and in fact is
                 given by the Poisson distribution:

                            Prob[x events in time unit t] = (mx/x!) e-m ,

                 where m is the average number of events per unit time interval.  (Do not confuse this m with the
                 m that stands for mutation rate.  Unfortunately, the symbol used is the same.)
            d.  Characteristic Poisson distributions:

 
This figure shows a series of Poisson distributions with different mean numbers of events per unit time, m .  Y-axis is proportion of time units (samples) with indicated number of events (x axis).  (From R. R. Sokal and F. J. Rohlf. 1969.  Biometry.  W. H. Freeman.)
 
 
 
 
 
 
 
 
 
 

        2.  It is therefore legitimate to ask whether actual amino-acid substitutions occur according to a Poisson
             process.

            a.  If so, this supports the neutralists' contentions.
            b.  If not, this may constitute evidence for the selectionists contentions

    B.  Testing for a Poisson process

        1.  Whether substitutions follow a Poisson distribution can be tested by a variety of statistical procedures,
             all of which are too complicated for consideration here.

        2.  However, it is possible to give a heuristic discription of these tests.

        3.  One approach is to use "star phylogenies": a phylogeny in which several different lineages converge
             virtually simultaneously to the same common ancestor.

        4.  An example of a star phylogeny is given by the radiation of mammals, in which a number of different
             mammalian families arose at approximately the same time (approximately 140 million yrs ago at the end of
             the Jurassic period):


Radiation of several families of modern mammals from ancestral mammals of the Jurassic.  Numbers indicate estimated numbers of  amino-acid substitutions along indicated lineages for b globin.  (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press.; original data from M. Kimura.  1987.  J. Mol. Evol. 26: 24-33.)
 
 
 

        5.  The time since the common ancestor is the same for all lineages in a star phylogeny.

        6.  Using techniques that we will discuss in a couple lectures, it is possible to infer, for a particular
             protein, the amino acid sequence of the common ancestor, given the sequences of the modern
             descendents.

        7.  One can then calculate the number of amino-acid substitutions that occured in each lineage.

        8.  To test whether the molecular clock follows a Poisson process, one must assume that the underlying
             rates of amino-acid substitution are the same for the different lineages.

        9.  This rate may then be calculated as the average observed rate, averaged over the different lineages.

        10.  Given this common rate, the Poisson equation above predicts how much variation among the lineages
               in observed substitution rates one should expect due to chance effects.

        11.  One can then ask whether the actual variation in observed rates is greater than or less than what
               is expected for a Poisson process.

    C.  The data

        1.  As an examle, Kimura  examined 5 proteins for the star phylogeny depicted in the Figure above.

        2.  All five exhibited deviations from expectations under the Poisson clock.

Estimates of an "index of dispersion", R, for a radiation of mammals.  R=1 if amino-acid substitutions follow a Poisson clock.  R>1 indicates overdispersion.  (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press.; original data from M. Kimura.  1983.  The Neutral Allele Theory of Molecular Evolution.  Cambridge Univ. Press.)
 
 
 
 

        3.  Other studies have obtained similar results:


 (Analysis of applicability of Poisson model to sequence data for 20 loci from rodents, primates, and artiodactyls.  The relevant column of this table is the one headed by Rr(t), which is comparable to the index of dispersion from the previous example.  In this example, 12 of 20 loci exhibited statistically significant deviations from the Poisson clock in the direction of overdispersion.  (From J. H. Gillespie.  1991.  The Causes of Molecular Evolution.  Cambridge Univ. Press.; original data from Li et al. 1987.  J. Molec. Evol. 25: 330-342.)
 
 
 
 
 
 
 
 

        4.  In general, most studies have found that observed substitution rates are overdispersed--there
             are more lineages that have high observed rates of substitution and more lineages that have
             low rates of substitution than is expected under a Poisson process.

        5.  Another way of saying this is that the molecular clock tends to be episodic:  there are short bursts
             of several substitutions followed by periods with few substitutions, followed by a short burst,
             and so on.

        6.  This pattern is easily explained by the selectionists:  adaptations come in bursts in response to
             periodic environmental changes experienced by organisms.

        7.  Neutralists have more difficulty explaining this episodicity--it can be done by assuming that the
             genetic drift gives rise to a substitution process that is more complicated that a Poisson process,
             but the biological justification for making such an assumption is unclear.

    D.  Conclusion:  Episodicity of molecular clock is most easily explained by assuming adaptation to changed
          environments is episodic, but the possibility that genetic drift produces episodic change can not be
          ruled out definitively.

VI.  dN/dS Ratios
 
    A.  The advent of high-throughput DNA sequencing (as opposed to protein sequencing) has fostered the development
                of analytical approaches to quantifying patterns of DNA variation both within and among species.

         1.  Coupled with the development of coalescence theory (an extension of the neutral theory), these approaches
                allow predictions to be made regarding the patterns of nucleotide variation that are expected under
                neutrality vs. the historical operation of natural selection.  
   
         2.   In the next sections, we consider some of these predictions, and how they have been tested.  As we shall see,
                this line of research has essentially settled the neutralist-selectionist debate.

     B.  dN/dS ratio:   (see handout)

                 (No. non-synonymous substitutions/ No. non-synonymous sites)/ (No. synonymous substitutions/ No. synonymous sites)

          1.  Under strict neutrality (i.e. most amino-acid substitutions are due to drift), expect  dN/dS = 1.

           2.  A value of  dN/dS <  1 indicates operation of purifying selection, but doesn't rule out operation of repeated positive selection.

           3.   A value of dN/dS > 1  indicates the operation of repeated positive selection.

    C.   These relations suggest that evidence for repeated positive selection can be obtained from cases in which dN/dS > 1.

    D.  Example:  evolution of  chitinase in Arabidopsis (Arabis) and close relatives. (Bishop, Dean and Mitchell-Olds, 2000, PNAS 97: 5322-5327)

            1.  Chitinase is a defensive enzyme in plants that breaks down cell walls of fungal pathogens.

            2.  Sequenced chitinase from 15 species of Arabis and one species of the related genus Brassica.  

            3.  Calculated and plotted dN vs. dS (or Ka vs. Ks) for all pairs of species:

               

           4.  These results indicate that a substantial number of comparisons exhibit a dN/dS ratio significantly > 1,
                    pointing to repeated positive selection as the cause of many of the amino acid substitutions.  

           5.  In addition, the spatial pattern of substitutions also points to positive selection:


                 

               a.  For class 1 Chitinases (the type examined in this study), the per-site rate of amino-acid substitution
                        is the same for sites within and outside of the enzyme's active site.

                b.  By contrast, for other enzymes (e.g. Lysozyme and Class III chitinases), the per-site of aa substitution
                        is much smaller in the active site.

                c.   It is possible, using techniques that cannot be described here, to infer WHICH amino-acid substitutions
                        were due to positive selection and which to drift.  For the chitinases in this study, approximately
                        15% of the substitutions were inferred to be due to selection.  ALL OF THESE OCCURRED
                         IN THE ACTIVE SITE OF THE ENZYME, a highly non-random pattern.  

                d.   This pattern suggests that coevolution between plants and pathogens drives adaptive amino-acid substitution
                        in chitinase.

            6.  Similar examples of dN/dS > 1 have been found in a variety of organisms.  Interestingly, most of these
                        examples involve proteins that are involved in male reproductive characters or gamete recognition.
                        Together, these examples demonstrate that repeated positive selection may be a common phenomenon.
                        However, these cases admittedly do not involve a random sample of proteins, and so they do not
                        constitute evidence that most protein evolution is due to selection rather than genetic drift.


VII.  Resolving the Neutralist-Selectionist Controversy

    A.  What we have seen so far is that varius approaches to distinguishing between neutralist and
          selectionist theories have been problematical.

    B.    What is exciting is that over the past five to ten years, new techniques,  based on analysis of  DNA
            sequence variation,  have been developed that in principle can resolve this question.

    C.   The prototype study using these techniques was performed by McDonald and Kreitman and focused
           on attempting to determine the extent to which natural selection and genetic drift have been
           responsible for protein sequence divergence at the Adh locus between Drosophila species.

        1.  Like most other proteins, between-species comparisons of the Adh sequence show fixed
             differences, indicating evolutionary change in protein primary structure.

        2.  Like most proteins, Adh is also variable within populations, both at silent sites and also
             in amino-acid composition.

        3.  The McDonald-Kreitman test uses this variation within populations to deduce the nature of
             the evolutionary processes that have caused the between-species differences.

    D.  Logic underlying the M-K test.

        1.  Suppose that the neutralists are right, that virtually all evolutionary change in proteins is due
             to genetic drift involving neutral alleles.

        2.  This supposition implies that the fixed amino-acid differences between species are due to
             genetic drift, and we can therefore make the following prediction:

            a.  Divide both within-species and between-species variation in nucleotide sites into two
                 categories

                i.  those due to non-synonymous nucleotide substitutions, and
                ii. those due to synonymous substitutions

            b.  Then the ratio of synonymous to non-synonymous substutions should be the same
                 between species as it is within species.

        3.  Explanation of this prediction:

            a.  Let the mutation rate per site for neutral synonymous mutations be ms
            b.  Let the mutation rate per site for neutral non-synonymous mutations be  mn
            c.  Then, because both classes of mutations are governed by genetic drift, they
                 have the same probabilities of being eliminated, fixed, or polymorphic at any given
                 time after they arise.
            d.   This means that the ratio of synonymous to non-synonymous polymorphic sites
                 at any time should be

    (No. of synonymous sites polymorphic/gene)/(No. of non-synonymous sites polymorphic/gene)    = msKs / mnKn   (1)

                 where Ksand Kn are the number of synonymous and non-synonymous sites in the gene.
            e.  Next, consider fixed differences between two species.
            f.  Fixed differences at either synonymous or non-synonymous sites arise due to
                divergence from the sequence of the common ancester.
            g.  Consider the expected number of synonymous sites differing between that ancestor
                 and one of the species.  This is given by

     No. differences/gene = (No. mutations/gene/time) x (divergence time) x (prob a mutation is fixed)

                                     =           ( msKs )                  x                         x                (1/2N)

                                     =          tmsKs / 2N

            h.  Now, assuming the mutation rate and population size of the other species are similar,
                 the expected number of synonymous sites differing between that species and the
                 ancestor is also  tmsKs / 2N.
            j.   Consequently, the expected number of synonymous differences between the two
                 living species is just the sum of these two numbers, i.e.

                         Expected No. synonymous differences/gene =  tmsKs / 2NtmsKs / 2N =   tmsKs / N    (2)

            k.  A similar line of reasoning indicates that the expected number of non-synonymous
                 differences between the two living species is just

                        Expected No. non-synonymous differences/gene =    tmnKn / N         (3)

            l.  Using these results (Equations (2) and (3) ), the expected ratio of synonymous to
                non-synonymous fixed differences between the two species is just

                        No. synonymous differences/geneNo. non-synonymous differences/gene = msKsmnKn       (4)

            m.  Comparing Equations  (1) and (4), it is clear that, assuming most amino-acid
                  differences between species are due to drift,

    No. synonymous differences between species/gene     =  
No. non-synonymous differences between species/gene

                                                                   No. of synonymous sites polymorphic within species/gene         (5)
                                                              No. of non-synonymous sites polymorphic within species/gene

            n.  In other words,  under the neutralist view, in which most polymorphisms as well as
                 most fixed differences in proteins are selectively neutral, the ratio of variable
                 synonymous sites to variable synonymous sites should be the same for
                 within-species polymorphisms and for fixed-differences between species.

            o.  Suppose, however, that most differences between species in non-synonymous sites are
                 due to selection fixing advantageous mutants.

                i.  These show up as fixed differences between species, but they don't show up
                    as non-synonymous polymoprhisms within a species because they are rapidly
                    fixed soon after they arise.
                ii.  This effectively means that the denominator of the right-hand side of
                    equation (5) is less than the situation that pertains under strict neutrality of
                    non-synonymous mutations.
                iii.  Consequently, if selection is responsible for a substantial proportion of
                     fixed non-synonymous differences, we should have

   No. synonymous differences between species/gene   
No. non-synonymous differences between species/gene     <

                                                                      No. of synonymous sites polymorphic within species/gene
                                                                  No. of non-synonymous sites polymorphic within species/gene              (6)

            p.  Thus, the selectionist and neutralist views can be distinguished on the basis of
                 equations (5) and (6).

    E.  The data for Adh

        1.  McDonald and Kreitman applied this type of analysis to variation at the Adh (alcohol
             dehydrogenase) locus in D. melanogaster and D. simulans.

        2.  They sequenced 12 Adh alleles from D. melanogaster and six alleles from D. simulans.
 
        3.  Found the following results for number of differences within and between species:

                                           Site type                   Within species         Between species
 
                                           Synonymous                      42                           17
 
                                           Non-synonymous                 2                             7

        4.  We thus have

    No. synonymous differences between species/gene
No. non-synonymous differences between species/gene   =  17/7  = 2.43   <<

                                                                            21 = 42/2 =     No. of synonymous sites polymorphic within species/gene
                                                                                                No. of non-synonymous sites polymorphic within species/gene
 

        5.  In other words, equation (6) is correct and equation (5) is not.

        6.  The pattern of variation at the Adh locus thus supports the notion that a large fraction of the
             between-species differences at non-synonymous sites are due to natural selection fixing
             advantageous mutants.

    F.  M-K tests on a genomic scale

        1.  As with previous analyses described above, one or a few significant cases of repeated positive selection
                as revealed by the M-K test does not permit a generalization about the role of selection in protein
                 evolution.

         2.  Recently, however, with the advent of high throughput sequencing, it has been possible to examine a large
                  sample of proteins using the MacDonald-Kreitman approach.

                a.  Fay et al.  2002.  Nature 415: 1024-1026.  M-K approach using 45 different genes from Drosophila melanogaster and
                        D. simulans :

                                                                 Polymorphism      Divergence

                            Non-syn.  substitutions            65                    598

                             Syn . substitutions                  224                   950


                            --Under neutral evolution, expect only 276 non-synonymous substitutions in divergence between the two species, rather
                                    than the observed 598.  Indicates an excess of non-synonymous substitions compared to neutral expectation,
                                    and these substitutions must have been due to postitive selection.  Best estimate is that 54% of subistutions between
                                    the two species were caused by natural selection.

                b.  Fay et al.  2001.  Genetics  158: 1227-1234.  In a similar study on humans and primates, found that 35% of amino-acid substitutions
                           were caused by selection.

                c.  Smith and Eyre-Walker.  2002.  Nature 415: 1022-1024.  Similar study comparing Drosophila simulans and D. yakuba.  Found

                            that 45% of amino-acid substitutions were due to selection.  They estimated that one adaptive substitution has occurred
                            in these lineages once every 45 years, on average.


VIII.  Conclusions

            1.  Based on current evidence, which represents a good more-or-less random sample of proteins from Drosophila and from Primates, approximately
                    45% of amino-acid substitutions are adaptive, i.e. caused by natural selection.  The remainder are caused by genetic drift.

            2.  This conclusion supports the Selectionist viewpoint more than the Neutralist viewpoint.  Selectionists did not deny that some substitutions were due
                    to genetic drift, just not a large majority.  Neutralists, by contrast, argued that adaptive substitution was very rare and that most substitutions
                    are caused by drift.  This clearly is not the case.

IX.  Assessing Historical Patterns of Selection:  Polymorphism at the Adh locus

    A.  Reconstructing history

        1.  The previous example, besides demonstrating how analysis of sequence data can address
             the neutralist-selectionist controversy, illustrates how such data can also be used to
             infer characteristics about past evolutionary processes (e.g. that evolutionary change of
             amino-acid sequence of the Adh locus has been governed largely by selection rather than
             drift).

        2.   This is just one illustration of the power of sequence data to reveal the nature of historical
              evolutionary processes.

        3.   Beginning today, we shall examine other methods for reconstructing evolutionary history,
        4.   Today, however, we will continue to examine ways of detecting past history of selection.

    B.  The Adh locus.

        1.  Alcohol dehydrogenase is an enzyme that metabolizes alcohol.  Since alcohol is a common
             component of rotting fruit, a common larval habitat of Drosophila, its proper functioning
             is necessary for avoiding alcohol poisoning in developing larvae.

        2.  Drosophila melanogaster is distributed worldwide, and throughout its range it exhibits
             a polymorphism for electrophoretic variants.

        3.  The variants come in two classes, Fast and Slow, names indicative of the rates at which
              these variants migrate on an electrophoretic gel.

        4.   The difference between Fast and Slow alleles is due to the substitution of threonine in the
              Fast alleles for lysine in the Slow alleles at position 1490 of the Adh gene.

        5.   Within an allele class (Fast or Slow), there is a great deal of variation at other nucleotide
              sites, so that the nucleotide sequences of all Fast alleles are not the same, nor are the
              sequences of all Slow alleles.

        6.   Virtually all of this variation within an allele class is silent, however, i.e. it does not result
              in an amino-acid substitution.

        7.   Because of the world-wide distribution of the Fast-Slow polymorphism, it has been
              conjectured that this is an ancient polymorphism actively maintained by some form
              of "balancing selection" .

            a.  Balancing selection is any process that prevents either allele from being eliminated.
            b.  Heterozygote superiority (e.g. sickle-cell polymorphism) is one type of balancing
                 selection
            c.  Frequency-dependent selection (e.g. selection maintaining tristyly in plants) is
                 another type of balancing selection

    C.  Equilibrium levels of silent-site variation under genetic drift without balancing selection

        1.  In this and the next sections, it is shown that long-term balancing selection should result in
             excess silent-site variation at sites immediately surrounding the site of the Fast-Slow
             polymorphism, an excess that should not be present if there has not been long-term balancing
             selection.

        2.  Here we first consider the accumulation of genetic variation in the absence of balancing
             selection.

        3.  Think of a stretch of DNA, i.e. a gene.  Let us say for convenience that this gene is 1000 base
            Consider the silent sites in this gene, at which variants are selectively neutral and hence
            governed by genetic drift.

        4.  Suppose initially all copies of this gene are identical in a population.

        5.  Over time, mutations will arise at many of the silent sites

         6.  Because the fate of these mutations will be governed by genetic drift,  some of the mutant
              alleles will drift to intermediate frequencies, while others will be eliminated.

        7.  After a short time, then, there will be several sites that are variable in the population.

        8.  At a later time, more mutations will have arisen,  increasing the number of variable sites.

        9.  On the other hand, some of the variable sites present earlier will become fixed due to genetic
             drift, which will decrease the number of variable sites.

        10.  These two tendencies (8 and 9) oppose each other, and will eventually come to balance one
               another: at some point, on average, the increase in the number of variable sites due to
               mutation  will be just counterbalanced by the decrease due to fixation, yielding an
               equilibrium.

        11.  This equilibrium can be described graphically by a simple model:

            a.  We plot first the number of new sites becoming variable per  unit time.
            b.   Since sites that are already variable can not become new variable sites,
                  the number of sites becoming newly variable per unit time should be
                  proportional to the number of sites that are not already variable.
            c.   Consequently, the number of sites becoming newly variable is a
                 decreasing function of the number of sites already variable, as shown in  the Figure below.
            d.   Next we plot the number of variable sites that become fixed per unit time.
            e.   Since only sites that are currently variable can become fixed, the number
                  of sites becoming fixed should be proportional to the number of sites that are currently variable.
            f.   Consequently, the number of sites becoming newly fixed is an  increasing function of the number
                 of sites already variable, as shown in the Figure.
 

            g.  But when the number of newly arising variable sites is equal to the number
                 of variable sites becoming fixed per unit time, there is no net change in the
                 number of variable sites, i.e. that number is at an equilibrium.
            h.  This equilibrium corresponds to where the two lines in the figure intersect.
            i.   Once this equilibrium has been reached, there will be no change in the number  of variable sites.
            j.   However, which sites are variable will constantly change as mutations arise at some sites and become
                 fixed at others.
            k.  We thus say that this equilibrium is a dynamic equilibrium.

        12.  An alternative way of portraying this equilibrium is to plot the number of sites variable vs. time

            a.  Starting with a situation at which there are just a few variable sites, the population
                 will be at point 1 on the time axis of the left figure:


 

            b.  Because the number of new variable sites created is greater than the number of variable sites fixed (green line),
                 there will be an increase in the number of variable sites equal to the difference between the red and blue lines
                 (equal to length of green line).  This is portrayed by the number of variable sites at time 1 in the right  graph.

            c.  Going back to the left graph, the population at time 2 (difference between point 1 and point 2 on x-axis equals the
                 height of the green line at point 1, which is also equal to the horizontal teal line starting at point 1) still experiences
                 a higher number of variable sites created than variable sites fixed, though the magnitude of the difference
                 (green line at point 2) is less than in the previous time period.

            d.  The total number of variable sites is increased by this difference, and plotted on the right graph as the point
                 corresponding to 2 on the x-axis.

            e.  This process is repeated for subsequent points, generating the red line on the right graph.
            f.   Conclusion:  The number of variable sites increases at first, but then levels off at an equilibrium.

        13.  This analysis indicates that by themselves, the processes of drift and mutation lead to a balance that puts an upper
               limit on the amount of variation in a gene, as measured by number of variable silent sites.

    D.  Levels of silent-site variation with balancing selection but no recombination

        1.  Next, consider what is expected to happen to number of variable silent sites if there is at a non-synonymous
             site in the gene that is kept polymorphic by balancing selection (as is hypothesized for the amino-acid
             polymorphism in Adh).

        2.  Suppose that initially the population starts out with the alleles at the non-synonymous site at roughly equal
             frequencies, and that at all other sites these two alleles are identical.

        3.  For the time being,  also assume that there is no recombination between these two alleles.

        4.  As before, we expect mutations to arise at silent sites in both alleles, and for some of these mutations drift to fixation.

        5. The key thing to recognize though, is that different sites will mutate and become fixed on the different alleles.
 
        6.  Over time, more and more mutations will become fixed, again, with different mutants becoming fixed on the
             two allele types.

        7.  In this case, however, there is no process analagous to the fixation we saw earlier to counteract this steady build
             up of differences between the two allele types:  while new mutants become fixed with one allele type, this increases,
             rather than decreases, the number of silent sites that differe between the two alleles.
 
         8.  The two alleles will thus gradually and steadily diverge over time in the number of sites that differ.  Another
              way of saying this, is that if we compare pairwise sites on one allele with sites on the other allele, the number of
              sites that are variable will increase steadily with time (blue line in figure below):
 

        9.  This means that, unlike when just genetic drift and mutation is operating, there is effectively no limit to the amount of
             variation (between allele types) that will accumulate.

    E.  Levels of silent-site variation with balancing selection and recombination

        1.  In the previous section we assumed that there is no recombination between the two allele types.

        2.  This is likely to be a good approximation near the site of the amino-acid substitution (the site of balancing selection).

        3.   Away from that site, however, recombination is likely to be common.

        4.  Consider, in particular the opposite extreme: a stretch of DNA far removed from the amino-acid substitution, so that
             recombination occurs freely between this strech and the site of amino-acid substitution.

        5.  Then the situation is exactly the same as considered originally (section C above):  only drift and mutation are
             operating.  What's going on in the region near the amino-acid substitution has no influence on this region because
             recomgination between the two regions essentially decouples them.
 
         6.  Consequently, in the region far from the amino-acid substitution, comparing pairwise variation at sites corresponding
              to the two alleles is just like comparing two DNA sequences at random from a population in which there is no
              selection.

        7.  We therefore expect an equilibrium level of silent-site variation between allele types to be established--an equilibrium
             level that will be far lower than the level of silent-site variation built up in the area surrounding the amino-acid
             substitution.

        8.  There are thus two extremes:  what happens in an area of no recombination around a site of balancing selection, and
             what happens in an area far removed from that site.

        9.  Inbetween, the level of variation is between these two extreems, as might be expected:    As one moves from the site
             of balancing selection to the distant site (e.g. toward  free recombination), the effects of recombination become more
             and more dominant, and the amount of silent-site variation decreases toward the equilibrium level at the distant site.
 
        10.  There is thus expected to be a peak of silent-site variation centered on the site of balancing selection; as one moves
               away from this site in either direction, the amount of variation is expected to decay to the equilibrium level:
 

    F.  Assessing levels of silent-site variation: the "sliding-window" technique (Kreitman and Hudson 1991)

        1.  To estimate levels of silent-site variation at different distances from the Adh amino-acid polymorphism, Kreitman and
             Hudson sequenced 11 copies (5 Fast alleles and 6 Slow alleles) of the region surrounding and including the Adh
             regions in D. melanogaster:


Boxes represent exons, lines between boxes represent introns.  Numbers below the figures indicate the number of polymorphic sites, and number of fixed differences between D. melanogaster and D. similans, within indicated gene regions.  Adh: the Adh gene.  Adh-dup: a duplication of the Adh gene, though to be a pseudogene (From M. Kreitman and R. R. Hudson.  1991.  Genetics 127: 565-582).
 
 
 

        2.  After aligning these sequences, they compared each Slow allele with each Fast allele in the following way:

            a.  They marked of a "window" beginning at the 5' end of the sequences and extending in the 3' direction far enough
                 to include 100 silent sites.
            b.  Within this window, they tabulated the number of sites that differed between the two sequences.
            c.   The window's 5' end was then moved to the right (3' direction) by one silent site, and the 3' end was extended to
                  include an additional silent site.
          &nbspd between the two sequences was then tabulated for that window.
            e.   procedures (c) and (d) were repeated until the entire sequence had been examined.
            f.   The data for a pair of sequences thus consisted of a set of numbers, corresponding to the number of variable sites
                 within each window.

        3.   The number for a particular window was then averaged over all pairs of Fast and Slow alleles to yield an average
              number of polymorphic silent sites for each window.

    G.  The results

        1.  Using the data just described, the authors plotted number of variable silent sites vs. position of the corresponding
             window (where that position is devined by the midpoint location of the window:

Variability at silent sites in the region of the D. melanogaster Adh locus portrayed using the sliding window technique.  The observed variability shows a marked peak centered on the site of amino acid substitution between Fast and Slow alleles (arrow).  The expected curve portrays the amount of variability expected under the assumption of no balancing selection. (From M. Kreitman and R. R. Hudson.  1991.  Inferring the evolutionary histories of the Adh  and Adh-dup loci in Drosophila melanogaster  from patterns of polymorphism and divergence.  Genetics 127: 565-582)
 
 
 
 

        2.  The data clearly demonstrate the existence of a peak of silent-site variation centered on the position of the amino-acid
             substitution responsible for the Fast-Slow polymorphism.

    H.  Conclusions

        1.  The pattern of silent-site variation in the region of the Fast-Slow polymorphism deviates markedly from the pattern
             expected in the absence of balancing selection at that polymorphism.

        2.  The deviation is in the direction expected if balancing selection has been acting on that polymorphism for extended
             periods.

        3.  The method used provides a powerful way of detecting the operation of long-term balancing selection.

X.  Assessing Historical patterns of selection:  Other patterns

    A.  Selective sweeps

        1.  Consider a non-synonymous, advantageous mutation that arises in a population and is fixed rapidly (within a few
             hundred generations)--a "selective sweep"

        2.  At the time of fixation all copies of that allele are recently descended from the original mutant copy.

        3.  There will not have been much time for mutations to accumulate on any of those copies in the short time since the
             mutation arose.

        4.   Consequently, all copies will be identical not only at the selected site, but at most other sites, and in particular, at most
              silent sites.

        5.   The amount of silent-site variation will thus be below the equilibrium level discussed in section II. C. above.

        6.   This depletion of variation will drop off as one moves away from the selected site, however, due to the effects of
              recombination, just as we saw previously.

        7.   The result is that the pattern of silent-site variation characterizing recent selective sweeps is a local depression of
              variation in the region immediately surrounding the selected site--the opposite of what is seen with balancing selection:
 

        8.   Thus, one is able to infer the occurrance of recent selective sweeps by local depression of silent-site variation         9.   This is another example of how patterns of variation within a gene may be used to infer something about the historical
              pattern of selection that has operated on that gene.
 
 






    [Lecture Outlines | Bio 120 Home Page | Department of Biology | Duke University ]