12
1 Lab 7: Estimating Population Structure Revised October 12, 2012 Introduction Real populations are often subdivided into smaller, more homogeneous spatial units called subpopulations or demes, whose degree of differentiation depends on the amount of genetic drift, the rates of gene flow and mutation, and the spatial variability of selection. The main goals of this lab are to: 1) Demonstrate the opposing effects of genetic drift and gene flow on population differentiation, 2) Gain hands-on experience with estimating and interpreting statistics that characterize population structure, and 3) Use Bayesian inference to simultaneously infer population structure and population assignment. Population Structure The extent of population structure is most commonly characterized by three inter-related parameters, F IT , F ST , and F IS , which are known as F-statistics (or F-coefficients): Parameter Meaning/Formula Probability of IBD for F IT Deviation from Hardy-Weinberg Equilibrium (HWE) in the total population: T O T IT H H H F = ; 2 alleles sampled from the same individual relative to 2 alleles sampled from the total population; F ST Genetic differentiation among subpopulations (or the proportion of genetic variation accounted for by allele frequency differences among subpopulations): T S T ST H H H F = ; 2 alleles sampled from the same subpopulation relative to 2 alleles sampled from the total population; F IS Deviation from HWE in a subpopulation: S O S IS H H H F = . 2 alleles sampled from the same individual relative to 2 alleles sampled from the same subpopulation. where H T is the expected heterozygosity in the total population, H S is the mean expected heterozygosity in subpopulations, and H O is the mean observed heterozygosity in subpopulations. The relationship between these parameters is: ) 1 )( 1 ( 1 IS ST IT F F F = . The intuitive interpretation of this relationship is that a pair of alleles sampled form the total population will not be IBD (1 F IT ) if they escape the effects of mating between relatives (1 F IS ) and the effects of mating occurring primarily within subpopulations (1 F ST ).

Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

  • Upload
    vunga

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

1

Lab 7: Estimating Population Structure Revised October 12, 2012

Introduction Real populations are often subdivided into smaller, more homogeneous spatial units called subpopulations or demes, whose degree of differentiation depends on the amount of genetic drift, the rates of gene flow and mutation, and the spatial variability of selection. The main goals of this lab are to:

1) Demonstrate the opposing effects of genetic drift and gene flow on population differentiation, 2) Gain hands-on experience with estimating and interpreting statistics that characterize population

structure, and 3) Use Bayesian inference to simultaneously infer population structure and population assignment.

Population Structure The extent of population structure is most commonly characterized by three inter-related parameters, FIT, FST, and FIS, which are known as F-statistics (or F-coefficients): Parameter Meaning/Formula Probability of IBD for FIT Deviation from Hardy-Weinberg Equilibrium

(HWE) in the total population:

T

OTIT H

HHF −= ;

2 alleles sampled from the same individual relative to 2 alleles sampled from the total population;

FST Genetic differentiation among subpopulations (or the proportion of genetic variation accounted for by allele frequency differences among subpopulations):

T

STST H

HHF −= ;

2 alleles sampled from the same subpopulation relative to 2 alleles sampled from the total population;

FIS Deviation from HWE in a subpopulation:

S

OSIS H

HHF −= .

2 alleles sampled from the same individual relative to 2 alleles sampled from the same subpopulation.

where HT is the expected heterozygosity in the total population, HS is the mean expected heterozygosity in subpopulations, and HO is the mean observed heterozygosity in subpopulations. The relationship between these parameters is:

)1)(1(1 ISSTIT FFF −−=− . The intuitive interpretation of this relationship is that a pair of alleles sampled form the total population will not be IBD (1 − FIT) if they escape the effects of mating between relatives (1 − FIS) and the effects of mating occurring primarily within subpopulations (1 − FST).

Page 2: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

2

Analysis of Molecular Variance (AMOVA) AMOVA is a statistical method that allows genetic variation (measured as the genetic distance among alleles or haplotypes) to be partitioned into various hierarchical levels. If, for example, subpopulations can be logically grouped into multiple regions within the total population, the among-subpopulation component of genetic variation FST can be further partitioned into variation explained by among-region differences (FRT) and variation explained by differences among subpopulations within regions (FSR). The estimates of hierarchical F-coefficients from AMOVA are called Φ-statistics and are calculated by partitioning variation into different sources very much like in the conventional Analysis of Variance (ANOVA). Parameter Estimate

in AMOVA Meaning/Formula Probability of IBD for

FST 222

22

cba

baST σσσ

σσ++

+=Φ ;

Genetic differentiation among subpopulations (the proportion of genetic variation which is accounted for by allele frequency differences among subpopulations):

T

STST H

HHF −= ;

2 alleles sampled from the same subpopulation relative to 2 alleles sampled from the total population;

FSR 22

2

cb

bSC σσ

σ+

=Φ ; Genetic differentiation among subpopulations within a region (the proportion of genetic variation which is accounted for by allele frequency differences among subpopulations within a region):

R

SRSR H

HHF −= ;

2 alleles sampled from the same subpopulation relative to 2 alleles sampled from the same region;

FRT 222

2

cba

aCT σσσ

σ++

=Φ . Genetic differentiation among regions (the proportion of genetic variation which is accounted for by allele frequency differences among regions):

T

RTRT H

HHF −= ;

2 alleles sampled from the same region relative to 2 alleles sampled from the total population.

where 2

aσ is the variance component due to differences among regions, 2bσ is the variance component due

to differences among subpopulations, and 2cσ is the variance component due to differences among alleles

within subpopulations. Problem 1 (5 points). The file human_struc.xls (which is already in GenAlEx format) contains data for 10 microsatellite loci used to genotype 41 human populations from a worldwide sample.

a) Five regions are already defined in the file (AFRICA, AMERICA, EAST ASIA, EURASIA, and OCEANIA). Convert the file into Arlequin format and perform AMOVA based on this grouping of populations within regions. How do you interpret these results? Report values of Φ-statistics and their statistical significance.

b) Do you think that any of these regions can justifiably be divided into subregions? Pick a region, form a hypothesis for what would be a reasonable grouping of populations into subregions (see information in Appendix 1 and map in Appendix 2), then run AMOVA only for the region you selected. Was your hypothesis supported by the data?

Page 3: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

3

c) GRADUATE STUDENTS ONLY: Which of the 5 initially defined regions has the highest diversity in terms of effective number of alleles? What is your biological explanation for this? Make sure that you cite your sources, and avoid dubious internet sites.

Guidelines for Problem 1 • Download file human_struc.xls and convert it to Arlequin format. Use the default options. • Once you load the data into Arlequin, go to Settings and select AMOVA. • Select AMOVA computations and do not

change anything else. • The output of AMOVA is straightforward,

so you should be able to easily extract the estimated values of the Φ-statistics (reported as F-statistics in Arlequin output).

• Make sure you look at the bootstrap P-values. If these are reported as 0, the F-statistic is significantly different from 0.

• Once you complete the two AMOVAs required for a) you will need to pick a region and form a hypothesis of how subpopulations within that region can be grouped. For example, you can take region AMERICA and form subregions NORTH AMERICA and SOUTH AMERICA.

• After you pick a region and a grouping within the region, go back to the original file human_struc.xls and delete the data for all other regions, then make changes to reflect the dimensions of the newly formed data set (see below).

• For example, the sample file below contains data for 6 microsatellite loci used to genotype 24 individuals from 8 populations, which are grouped into 2 regions.

• Assume that you decide to focus on Region 1 and form a group consisting of populations 1 and 2 and a group consisting of populations 3 and 4:

- First, delete all rows containing data for all other populations (in this case, populations 5, 6, 7, and 8 or individuals 13-24).

- Then, make changes to the first line of the file to reflect the new dimensions of the data (see below). Be sure that the order of populations on the top line matches the order in which they appear in the sample genotypes.

• At this point you are ready to export the data into Arlequin format and run AMOVA as in a).

Page 4: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

4

Bayesian Inference of Population Structure

Because a priori identification of population structure is inevitably subjective (and often completely incorrect), there has been a great deal of interest in developing methods of inferring population structure objectively from genotypic data. The approach implemented in the Structure program (Pritchard et al. 2000) has been by far the most popular of these methods. The underlying idea of this method is to assign individuals to K groups (or clusters) in a way that minimizes the deviations from Hardy Weinberg and linkage equilibrium in each group. Thus, this approach uses the multilocus genotypes of a sample of individuals to simultaneously estimate the number of genetically distinct groups (K) and assign individuals to these groups. When some degree of gene flow among the K groups is assumed, Structure does not assign individuals to groups categorically, but rather estimates the admixture proportions (i.e., the proportion of the genome derived from each of the K groups) for each individual. The method is entirely based on Bayesian inference through Markov Chain Monte Carlo (MCMC) simulations in which parameter values are ‘visited’ with frequencies proportional to their posterior probabilities. MCMC methods work well only when enough iterations are run to wash out the residual effect of the random initial parameter values (this is called Burnin or Dememorization period) and enough iterations are run after that to accurately approximate the posterior probability distributions of the estimated parameters. It is not trivial to provide a definitive answer to the question “What is enough?” but a good general principle is that more is better. This means that when an MCMC approach is taken for the solution of a problem, one needs to have either a lot of patience or a fairly powerful computer (and preferably both). The first step in analyses using Structure is to determine the number of distinct genetic groups K. The program outputs an estimated log-likelihood for each value of K being tested. If the question of interest is whether there are two or three distinct genetic groups, for example, a fictitious output from the program could be:

K Log-likelihood Ln[P(Data | K)] 2 − 1235 3 − 1238

Page 5: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

5

The hypothesis K = 2 has a higher log-likelihood, and if the prior probabilities of the two hypotheses are assumed to be the same, their posterior probabilities are:

,9526.011

)1()|2( 331235

1235

12381235

1235

=+

=+

=+

==−−−

−−

eeee

eeeDataKP

.0474.01

1)1(

)|3( 331238

1238

12381235

1238

=+

=+

=+

==−

−−

eeee

eeeDataKP

This approach often does not provide a definitive answer because the differences in likelihood ratios are sometimes subtle, and there is no theoretical basis for calculating significance. An alternative is to identify the most likely value of K by examining the rate of change in likelihoods between adjacent values of K, which provides a robust, graphical approach to identifying the most appropriate K value (Evanno et al., 2005). This is known as the ΔK method, and is now a standard component of population structure analyses. We will implement and compare both of these approaches in this lab exercise. Once the value of K that has highest posterior probability has been determined, the admixture proportions estimated for individuals and subpopulations defined a priori can be scrutinized in an attempt to identify a biologically meaningful, objective pattern of population genetic structure. The admixture coefficients in the bar plot shown below, for example, clearly illustrate the existence of two distinct genetic groups, as well as their admixture in a zone of hybridization between two cottonwood species (Populus fremontii and Populus angustifolia) along the Weber River in Utah.

Problem 2. NOTE: The instructions given during the lab session were a bit misleading, so many of you did not follow the instructions detailed below. A revised version of Problem 2 is provided here, followed by the original version. If you answer the revised version, you will receive full credit. If you answer the original problem, you may receive up to 3 points of extra credit. Answer only one. Revised Problem 2 (worth 5 points): Use Structure to determine the best-supported number of groups for the full Problem 1 dataset.

a) Calculate the posterior probabilities to test whether: i. There are four genetically distinct groups, or ii. There are five genetically distinct groups.

If the posterior probably cannot be calculated, you can base your inferences on the average likelihood scores for each value of k.

b) Use the ΔK method to determine the most likely number of groups. How does this compare to the method based on posterior probabilities?

c) How do the groupings of subpopulations compare to your expectations from Problem 1? d) Is there evidence of admixture among the groups? If so, include a table or figure showing the

proportion of each subpopulation assigned to each group. GRADUATE STUDENTS ONLY: Provide brief, literature-based explanation for the groupings

P. fremontii Hybrid zone P. angustifoliaP. fremontii Hybrid zone P. angustifolia

Page 6: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

6

Original Problem 2 (worth 8 points if you do this instead of the revised version): Use Structure to further test the hypotheses you developed in Problem 1b.

a) Calculate the posterior probabilities to test whether: i. All subpopulations form a single, genetically homogeneous group. ii. There are two genetically distinct groups within your selected region. iii. There are three genetically distinct groups within your selected region.

b) Use the ΔK method to determine the most likely number of groups. How does this compare to the method based on posterior probabilities?

c) How do the groupings of subpopulations compare to your expectations from Problem 1? d) Is there evidence of admixture among the groups? If so, include a table or figure showing the

proportion of each subpopulation assigned to each group. e) GRADUATE STUDENTS ONLY: Provide brief, literature-based explanation for the groupings

you observe. Guidelines for Problem 2.

• The Structure user interface is a bit finicky, so please follow these instructions carefully! • Go to the data sheet in the file you created for Problem 1b. Select GenAlEx / Export Data/

Structure. Click OK on the Structure Export Parameters window and save the file as an MS-DOS Formatted Text file in the Documents folder.

• The rest of this lab will work best if you reboot your computer with Mac OSX. Files can be

found in the Windows 7 link on the desktop. • Launch Structure from the program tray or the programs folder. • Under File, select New Project, give a name to your project, select a directory in which the

project parameters and results will be stored (e.g., Documents), and load the data file you exported using GenAlEx. Then click on Next.

Page 7: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

7

Notes:

• You can get to the class folders by browsing to “iMac”, “Volumes” and “Windows 7.” But the Documents folder is the most convenient place to access the files with this program.

• If you open the directory rather than just select it when browsing, the program will not accept your choice. Just click on the data folder once.

• Set Number of individuals (refer to top line of GenAlEx file if you forget), Number of loci = 10,

Missing data value = − 9. Then click Next.

• Check Row of marker names and Data file stores data for individuals in a single line, then

click Next.

Page 8: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

8

• Check Individual ID for each individual, Putative population of origin for each individual, and USEPOPINFO selection flag, then click Finish and Proceed on the Confirmation window.

• Once the data have been successfully loaded in the project, select Parameter Set / New. Set

Length of Burnin Period = 10,000 and Number of MCMC Reps after Burnin = 10,000, click OK, give a name to this parameter set, then click OK again.

• Select Project / Start a job. Set K from 1 to 5, set the number of iterations to 5, and click on

the parameter set you generated, then click Start. If everything is working properly, you should see numbers scrolling in the black window at the bottom of your screen.

Page 9: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

9

• Be patient, the analyses will take a while. When the runs are done, a window will pop up to inform you. If you cannot switch to the Structure menu by clicking with the mouse, try navigating with the alt+tab or function_tab. The java windows sometimes get lost.

• The results of these runs can be accessed through the Structure interface and copied to Excel for further analyses. However, it can be tedious to navigate through the files and copy and paste all of the key outputs. Fortunately, a convenient analysis pipeline, Structure Harvester, has been developed by Dent Earl. However, leave Structure open, because we will be returning to your output.

• Download structureHarvester.tar.gz from the class website or go to http://users.soe.ucsc.edu/~dearl/software/struct_harvest/

• Save the file to the desktop. • Double click on “structureHarvester.py.tar.gz” and open the folder

structureHarvester that it creates. • Put the “Results” folder from Structure output in the structureHarvester folder on

the desktop. • Open X11 (the white box icon with a big black X in side in your program

tray) or Finder: Applications/Utilities/X11.app • At the command prompt, type cd /Users/biology/Desktop/structureHarvester/ • Type ./structureHarvester.py --dir=Results --out=output --evanno and hit return. • Now you should see two files (evanno.txt and summary.txt) in the output folder. • Open evanno.txt in Excel. Make a line chart of with delta K as the Y-value and K as

the X-value. • Use summary.txt file to calculate posterior probabilities using the average values at

the top of the sheet. • To determine the proportion of each subpopulation assigned to each cluster, you should

first identify which run had the highest log likelihood among the runs for the best K value, as determined in parts a) and b). These values are conveniently summarized at the bottom of the Structure Harvester output.

• Return to the Structure program and select the correct run under the Results pane on the right. The individual admixture coefficients (which are further down in the same window) can be visualized by selecting Bar plot / Show / Group by Pop ID. The bar plots can be exported as images in JPEG format (click on Save).

Page 10: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

10

• Population names are not retained in Structure, but the order from the original MS Excel file is preserved. Bar plots containing population names can be produced with the programs CLUMPP and Distruct (Rosenberg 2004).

Bibliography: Cann, H. M., C. de Toma, L. Cazes, M. F. Legrand, V. Morel, L. Piouffre, J. Bodmer, W. F. Bodmer, B.

Bonne-Tamir, A. Cambon-Thomsen, Z. Chen, J. Y. Chu, C. Carcassi, L. Contu, R. F. Du, L. Excoffier, G. B. Ferrara, J. S. Friedlaender, H. Groot, D. Gurwitz, T. Jenkins, R. J. Herrera, X. Y. Huang, J. Kidd, K. K. Kidd, A. Langaney, A. A. Lin, S. Q. Mehdi, P. Parham, A. Piazza, M. P. Pistillo, Y. P. Qian, Q. F. Shu, J. J. Xu, S. Zhu, J. L. Weber, H. T. Greely, M. W. Feldman, G. Thomas, J. Dausset, and L. L. Cavalli-Sforza. 2002. A human genome diversity cell line panel. Science 296:261-262.

Evanno,G., Regnaut,S., and Goudet,J. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular Ecology 14:2611-2620.

Excoffier L., P. Smouse, J. M. and Quattro. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: applications to human mitochondrial DNA restriction data. Genetics 131: 479-491. 1992.

Hartl D., and A. G. Clark. 1997. Principles of Population Genetics. Sinauer Associates, Inc., Sunderland, MA. Chapter 4.

Hedrick, P. W. 2011. Genetics of Populations. Jones and Bartlett, Sudbury, MA. Chapter 7. Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population structure using multilocus

genotype data. Genetics 155:945-959. Rosenberg,N.A. 2004. DISTRUCT: a program for the graphical display of population structure.

Molecular Ecology Notes 4:137-138. Rosenberg, N. A., J. K. Pritchard, J. L. Weber, H. M. Cann, K. K. Kidd, L. A. Zhivotovsky, and M. W.

Feldman. 2002. Genetic structure of human populations. Science 298:2381-2385.

Page 11: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

11

Appendix 1

Geographic Origin Coordinates Population

SUBSAHARAN AFRICACentral African Republic 4N, 17E Biaka Pygmy*Democratic Republic of Congo 1N, 29E Mbuti Pygmy* Senegal 12N, 12W Mandenka Nigeria 6-10N, 2-8E YorubaNamibia 21S, 20E San Kenya 3S, 37E Bantu NE (12) S. Africa Bantu S.E. 29S, 30E Bantu S.E. Pedi (1)S. Africa Bantu S.E. 29S, 29E Bantu S.E. Sotho (1)S. Africa Bantu S.E. 28S, 24E Bantu S.E. Tswana (2)S. Africa Bantu S.E. 28S, 31E Bantu S.E. Zulu (1)S. Africa Bantu S.W. 22S, 19E Bantu S.W. Herero (2)S. Africa Bantu S.W. 19S, 18E Bantu S.W. Ovambo (1)

NORTH AFRICAAlgeria (Mzab) 32N, 3E Mozabite

MIDDLE EASTIsrael (Negev) 31N, 35E Bedouin*Israel (Carmel) 32N, 35E DruzeIsrael (Central) 32N, 35E Palestinian

ASIAPakistan 30-31N, 66-67E BrahuiPakistan 30-31N, 66-67E BalochiPakistan 33-34N, 70E HazaraPakistan 26N, 62-66E MakraniPakistan 24-27N, 68-70E Sindhi*Pakistan 32-35N, 69-72E PathanPakistan 35-37N, 71-72E KalashPakistan 36-37N, 73-75E BurushoChina 26-39N, 108-120E Han*China 29N, 109E Tujia (minority)China 28N, 103E Yizu (Yi) (minority)China 28N, 109E Miaozu (Miao) (minority)China 48-53N, 122-131E Oroqen (minority)*China 48-49N, 124E Daur (minority)China 48-49N, 118-120E Mongola (minority)*China 47-48N, 132-135E Hezhen (minority)*China 43-44N, 81-82E Xibo (minority)China 44N, 81E Uygur (minority)China 21N, 100E Dai (minority)China 22N, 100E Lahu (minority)China 27N, 119E She (minority)China 26N, 100E Naxi (minority)China 36N, 101E Tu (minority)Siberia 62-64N, 129-130E Yakut*Japan 38N, 138E Japanese*Cambodia 12N, 105E Cambodian

OCEANIANewGuinea 4S, 143E PapuanBougainville 6S, 155E NAN Melanesian

EUROPEFrance 46N, 2E French (various regions)France 43N, 0 BasqueItaly 40N, 9E SardinianItaly 46N, 10E from BergamoItaly 43N, 11E TuscanOrkney Islands 59N, 3W OrcadianRussia Caucasus 44N, 39E AdygeiRussia 61N, 39-41E Russian

AMERICAMexico 29N, 108W Pima (relative pairs) Mexico 19N, 91W Maya (relative pairs)Colombia 3N, 68W Piapoco and Curripaco*Brazil 10S, 63W Karitiana (relative pairs)Brazil 11S, 62W Surui (relative pairs)

Page 12: Lab 7: Estimating Population Structuresdifazio/popgen_12/labs/lab7_structure_2012... · Lab 7: Estimating Population Structure ... IT Deviation from Hardy-Weinberg Equilibrium (HWE)

12

Appendix 2