Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Lecture 2: Population Structure
02-‐715 Advanced Topics in Computa8onal Genomics
1
What is population structure?
• Popula8on Structure – A set of individuals characterized by some measure of gene8c
dis8nc8on
– A “popula8on” is usually characterized by a dis8nct distribu8on over genotypes
– Example Genotypes aa aA AA
Popula8on 1 Popula8on 2
2
1000 Genome Projects
3
Motivation
• Reconstruc*ng individual ancestry: The Genographic Project – hJps://genographic.na8onalgeographic.com/genographic/index.html
• Studying human migra*on – Out of Africa
– Mul*-‐regional hypothesis
• Study of various traits – Lactose intolerance
– Origins in Europe?
– Infer from
• Migra8on studies
• Muta8on studies in popula8ons
4
200,000 years ago
50,000 years ago
30,000 years ago 10,000 years ago
hJps://genographic.na8onalgeographic.com/genographic/index.html
5
Overview
• Background – Hardy-‐Weinberg Equilibrium
– Gene8c driZ – Wright’s FST
• Inferring popula8on structure from genotype data – Structure (Falush et al., 2003) – Matrix factoriza8on/dimensionality reduc8on methods (Engelhardt &
Stephens, 2010)
6
Hardy-Weinberg Equilibrium
• Hardy-‐Weinberg Equilibruim – Under random ma8ng, both allele and genotype frequencies in a
popula8on remain constant over genera8ons.
– Assump8ons of the standard random ma8ng • Diploid organism
• Sexual reproduc8on • Nonoverlapping genera8ons • Random ma8ng
• Large popula8on size • Equal allele frequencies in the sexes • No migra8on/muta8on/selec8on
– Chi-‐square test for Hardy-‐Weinberg equilibrium
7
Hardy-Weinberg Equilibrium
• D, H, R: genotype frequencies for AA, Aa, aa, respec8vely. • p q: allele frequencies of A and a
8
Hardy-Weinberg Equilibrium
• The genotype and allele frequencies of the offspring
9
Genetic Drift
• The change in allele frequencies in a popula8on due to random sampling
• Neutral process unlike natural selec8on – But gene8c driZ can eliminate an allele from the given popula8on.
• The effect of gene8c driZ is larger in a small popula8on
10
Population Divergence
• Wright’s FST – Sta8s8cs used to quan8fy the extent of divergence among mul8ple
popula8ons rela8ve to the overall gene8c diversity
– Summarizes the average devia8on of a collec8on of popula8ons a way from the mean
– FST = Var(pk)/p’(1-p’) • p’: the overall frequency of an allele across all subpopulations • pk :the allele frequency within population k
11
Scenarios of How Populations Evolve
12
Methods for Learning Population Structure from Genetic Markers
• Low-‐dimensional projec8on – PCA-‐based methods (PaJerson et al., PLoS Gene8cs 2006)
• Clustering – Distance-‐based (Bowcock et al., Nature 1994) – Model-‐based
• STRUCTURE (Pritchard et al., Gene8cs 2000) • mStruct (Shringarpure & Xing, Gene8cs 2008)
13
Probabilistic Models for Population Structure
• Mixture model – Cluster individuals into K popula8ons
• Admixture model – The genotypes of each individual are an admixture of mul8ple ancestor
popula8ons – Assumes alleles are in linkage equilibrium
• Linkage model – Model recombina8on, correla8on in alleles across chromosome
• F model – Model correla8on in alleles in ancestry
14
Mixture Model
• K popula8ons
• z(i): popula8on of origin of individual i
• For each of the K popula8ons – pklj: the frequency of allele j at locus l in popula8on k
15
Admixture Model
• Relax the assump8on of one ancestor per individual in mixture model
• Individuals can have ancestors in mul8ple different popula8ons
• qk(i): propor8on of individual i’s genome derived from popula8on k
• Alleles at different lock can come from different popula8ons
16
Structure Model
• Hypothesis: Modern popula8ons are created by an intermixing of ancestral popula8ons.
• An individual’s genome contains contribu8ons from one or more ancestral popula8ons.
• The contribu8ons of popula8ons can be different for different individuals.
• Other assump8ons – Hardy-‐weinberg equilbrium
– No linkage disequilbrium – Markers are i.i.d (independent and iden8cally distributed)
17
Linkage Model
• From admixture model, replace the assump8on that the ancestry labels zil for individual i, locus l are independent with the assump8on that adjacent zil are correlated.
• Use Poisson process to model the correla8on between neighboring alleles – dl : distance between locus l and locus l+1 – r: recombina8on rate
18
Linkage Model
• As recombina8on rate r goes to infinity, all loci become independent and linkage model becomes admixture model.
• Recombina8on rate r can be viewed as being related to the number of genera8ons since admixture occurred.
• Use MCMC algorithm to fit the unkown parameters.
19
F Model
• Introduce correla8ons in allele frequencies among ancestral popula8ons – pAl: allele frequencies in ancestral popula8ons modeled as symmetric
Dirichlet distribu8on
– Subpopula8ons of the ancestral popula8on go through gene8c driZ at different rate Fk
– Individuals are admixture of those K popula8ons who went through gene8c driZ from the common ancestral popula8on
20
F Model
• Rela8onship between Fk and FST
• Designed to between closely related popula8ons with similar allele frequencies
21
Scenarios of How Populations Evolve
22
Unknown Parameters To Be Estimated
• qi: the admixture propor8ons of individual i
• pk: allele frequencies of popula8on k • zi: popula8on label for each locus of individual i • r : recombina8on rate • Fk : es8mate of popula8on divergence from the ancestral
popula8on
23
Population Structure from Ancestry Proportion of Each Individual
• How to display popula8on structure?
Genetic structure of Human Populations (Rosenberg et al., Science 2002)#
Africa Europe Mid-‐East Cent./S. Asia East Asia Oceania
Ancestral proportion
24
Population of Origin Assignments of a Single Individual
True origin
Es8mated Origin (Unphased data)
Es8mated Origin (Phased data)
25
Admixture vs Divergence
26
Posterior Distribution of Recombination Rate
• Using the original dataset
• AZer permu8ng the genotype loci
27
Distinguishing Between Two Closely Related Populations
28
Three Sources of Linkage Disequilibrium
• Mixture LD – Due to varia8on in ancestry across individuals that induce correla8on
among markers at different loci – Modeled by admixture model
• Admixture LD – Due to unbroken chunks of DNA derived from an ancestor popula8on. – Modeled by linkage model
• Background LD – Due to LD within popula8ons – Decays at smaller scale
29
Low-dimensional Projections
• Gene8c data is very large – Number of markers may range from a few hundreds to hundreds of
thousands – Thus each individual is described by a high-‐dimensional vector of marker
configura8ons – A low-‐dimensional projec8on allows easy visualiza8on
• Technique used – Factor analysis – Many sta8s8cal methods exist – ICA, PCA, NMF etc. – Principal Components Analysis (next slide)
• Allows projec8on of individuals into a low dimensional space
• Usually projected to 2 dimensions to allow visualiza8on
30
Principal Component Analysis
• Most common form of factor analysis
• The new variables/dimensions ... – Are linear combina8ons of the original ones
– Are uncorrelated with one another • Orthogonal in original dimension space
– Capture as much of the original variance in the data as possible
– Are called Principal Components
• Demo at hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html
31
What are the new axes?
Original Variable A
PC 1 PC 2
• Orthogonal direc8ons of greatest variance in data • Projec8ons along PC1 discriminate the data most along any one axis
Original Variable B
32
Principal Components
• First principal component is the direc8on of greatest variability (covariance) in the data
• Second is the next orthogonal (uncorrelated) direc8on of greatest variability – So first remove all the variability along the first component, and then find the next direc8on of greatest variability
• And so on …
33
Dimensionality Reduction
Can ignore the components of lesser significance.
You do lose some informa8on, but if the eigenvalues are small, you don’t lose much
– n dimensions in original data – calculate n eigenvectors and eigenvalues – choose only the first p eigenvectors, based on their eigenvalues – final data set has only p dimensions
34
PCA Analysis (Cavalli-sforza,1978)
• Plot of geographical distribu8on of 3 PCs (Intensity propor8onal to value of each component) – First – blue
– Second -‐ green
– Third -‐ red
35
Matrix Factorization and Population Structure
• Matrix factoriza8on for learning popula8on structure
Genotype Data (NxP matrix)
N: number of samples P: number of genotypes
Individuals’ ancestry propor8ons (NxK matrix) K: number of subpopula8ons
Subpopula8on Allele Frequencies (KxP matrix) = x
36
Unifying Framework of Matrix Factorization
• Admixture – Based on probability models: rows of Λ and columns of F should sum
to 1. – Works well if the individuals are admixtures of discretely separated
popula8ons
• PCA – Based on eigen decomposi8on: columns of Λ are orthogonal, rows of F
are orthnormal. – Works well for the case of isola8on-‐by-‐distance (con8nuous varia8on
of popula8ons among individuals)
• Sparse factor model – Sparsity via automa8c relevance determina8on prior
37
Discrete/Admixed Populations
SFA
PCA
Admixture
Loading 1 Loading 2 Loading 3
38
Isolation-by-Distance Models
39
Clustered Populations in 1d Habitat • SFA
• Admixture
• PCA
Assume two popula8ons
Assume five popula8ons
Assume two popula8ons
Assume five popula8ons
40
Analysis of European Genotype Data
PCA SFAm Admixture 41
Comparison of Different Methods
PCA Model-‐based Clustering
Advantages • Sta8s8cal tests for significance of results (PaJerson et al. 2006) • Easy visualiza8on
• Genera8ve process that explicitly models admixture • Clustering is probabilis8c: it is possible to assign confidence level of clusters
Disadvantages • No intui8on about underlying processes
• Computa8onally more demanding • Based on assump8ons of evolu8onary models: • Structure: No models of muta8on, recombina8on • Muta8on added in mStruct • Recombina8on added in extension by Falush et al.
42