Principles of Evolution Web Page
        Handout: Quantitative Genetics Statistics



FROM LECTURE 10:   QUANTITATIVE GENETICS

HANDOUT: Essential Statistics for Quantitative Genetics

To understand quantitative genetics its essential to know some statistics. The best approach is to take a good course in statistics. But for the time being, here's a brief introduction. Our objective will be to make some descriptions about measurements made on a particular quantitative trait for a sample of individuals from a population. The three main concepts presented here will be (a) central tendency, (b) variability (or dispersion), and (c) the relationship between two traits (or variables).
 
 

Central tendency

Consider collecting data on a randomly selected sample of individuals from a specific population. Think of taking measurements on two quantitative traits x and y. Constructing a histogram for one of the traits would give us some indication of the distribution of this trait in the population. Remember our histogram of height for the students in the class. The first descriptive statistic we would want to compute would be the average, or the sample mean, which reflects the central tendency of the distribution. For x the mean of the sample is simply:
 
 






A remarkable property of the sample mean, which makes it such a valuable statistical tool, follows from the Central Limit Theorem, which states that whenever n is moderately large, the sample mean has approximately a normal distribution, regardless of the distribution of the underlying variable x. This is worth proving to yourself by generating a collection of numbers with any distribution and then repeatedly take fairly large samples from that collection and calculate the mean. The construct a histogram of the means you calculated. What might this tell us about quantitative traits remembering that we generally consider these treat to be controlled by many loci with additive effects?
 
 

Variability

The two measures of variability most often are considered are the sample variance and the standard deviation. Both measure the width of the distribution for traits of interest. First, here's the sample variance for x:
 
 






This describes the variability in terms of an "average" of squared deviations from the sample mean (we use n-1 for the sample variance because it provides a good estimator of the variability in the population). However, the units for the sample variance are squared in comparison to our original variable, therefore an alternative measure of variability is the standard deviation, which adjusts for this. The sample standard deviation is the square-root of the variance:
 
 






Relationship between variables

We will consider two general ideas that consider the relationship between variables which are regression and correlation. There are some very important distinctions between the application of these two analyses that we won't go in to here (but would be an important part of a statistics course dealing with these topics).

The simplest idea to think about for the relationship between two variables is the correlation. For example what is the correlation between height and weight in humans. There is certainly a positive relationship, but to quantify this relationship we calculate the sample correlation coefficientr. For the two variables x and y the sample correlation coefficient is:
 
 







which is the covariance between x and y (in the numerator) divided a measure of the variation in x and variation in y. The correlation coefficient ranges in value from -1 to +1, with zero indicating no relationship. The covariance is another population parameter that describes the average amount that two variables "covary."

A regression also can be used to describe the relationship between x and y (again there are some technical differences). In the context of parent-offspring regressions we are interested in finding the best-fit regression line for a plot of x and y. The equation for a regression line is similar to the familiar y = mx + b. It is:
 
 






where ß0 is the intercept of the line and ß1 is the slope of the line. To find the best-fit regression line a commonly used method is least-squares, which determines the line that minimizes the sum of the squared deviations from the line (in the direction of the vertical distances from the line). With this method the best-fitting straight line is determined by the formulas:
 
 

and






The "hats" on these parameters indicate they are estimates. Note the similarity between the estimate for the slope of the line and the correlation coefficient.
 


             [ Back to Lecture 11 | Lecture Outlines | Bio 120 Home Page | Department of Biology | Duke University ]