Lecture 2: Population Structure

Lecture 2: Population Structure

02-‐715 Advanced Topics in Computa8onal Genomics

1

What is population structure?

•  Popula8on Structure –  A set of individuals characterized by some measure of gene8c

dis8nc8on

–  A “popula8on” is usually characterized by a dis8nct distribu8on over genotypes

–  Example Genotypes aa aA AA

Popula8on 1 Popula8on 2

2

1000 Genome Projects

3

Motivation

•  Reconstruc*ng individual ancestry: The Genographic Project –  hJps://genographic.na8onalgeographic.com/genographic/index.html

•  Studying human migra*on –  Out of Africa

–  Mul*-‐regional hypothesis

•  Study of various traits –  Lactose intolerance

–  Origins in Europe?

–  Infer from

•  Migra8on studies

•  Muta8on studies in popula8ons

4

200,000 years ago

50,000 years ago

30,000 years ago 10,000 years ago

hJps://genographic.na8onalgeographic.com/genographic/index.html

5

Overview

•  Background –  Hardy-‐Weinberg Equilibrium

–  Gene8c driZ –  Wright’s FST

•  Inferring popula8on structure from genotype data –  Structure (Falush et al., 2003) –  Matrix factoriza8on/dimensionality reduc8on methods (Engelhardt &

Stephens, 2010)

6

Hardy-Weinberg Equilibrium

•  Hardy-‐Weinberg Equilibruim –  Under random ma8ng, both allele and genotype frequencies in a

popula8on remain constant over genera8ons.

–  Assump8ons of the standard random ma8ng •  Diploid organism

•  Sexual reproduc8on •  Nonoverlapping genera8ons •  Random ma8ng

•  Large popula8on size •  Equal allele frequencies in the sexes •  No migra8on/muta8on/selec8on

–  Chi-‐square test for Hardy-‐Weinberg equilibrium

7


•  D, H, R: genotype frequencies for AA, Aa, aa, respec8vely. •  p q: allele frequencies of A and a

8


•  The genotype and allele frequencies of the offspring

9

Genetic Drift

•  The change in allele frequencies in a popula8on due to random sampling

•  Neutral process unlike natural selec8on –  But gene8c driZ can eliminate an allele from the given popula8on.

•  The effect of gene8c driZ is larger in a small popula8on

10

Population Divergence

•  Wright’s FST –  Sta8s8cs used to quan8fy the extent of divergence among mul8ple

popula8ons rela8ve to the overall gene8c diversity

–  Summarizes the average devia8on of a collec8on of popula8ons a way from the mean

–  FST = Var(pk)/p’(1-p’) •  p’: the overall frequency of an allele across all subpopulations •  pk :the allele frequency within population k

11

Scenarios of How Populations Evolve

12

Methods for Learning Population Structure from Genetic Markers

•  Low-‐dimensional projec8on –  PCA-‐based methods (PaJerson et al., PLoS Gene8cs 2006)

•  Clustering –  Distance-‐based (Bowcock et al., Nature 1994) –  Model-‐based

•  STRUCTURE (Pritchard et al., Gene8cs 2000) •  mStruct (Shringarpure & Xing, Gene8cs 2008)

13

Probabilistic Models for Population Structure

•  Mixture model –  Cluster individuals into K popula8ons

•  Admixture model –  The genotypes of each individual are an admixture of mul8ple ancestor

popula8ons –  Assumes alleles are in linkage equilibrium

•  Linkage model –  Model recombina8on, correla8on in alleles across chromosome

•  F model –  Model correla8on in alleles in ancestry

14

Mixture Model

•  K popula8ons

•  z(i): popula8on of origin of individual i

•  For each of the K popula8ons –  pklj: the frequency of allele j at locus l in popula8on k

15

Admixture Model

•  Relax the assump8on of one ancestor per individual in mixture model

•  Individuals can have ancestors in mul8ple different popula8ons

•  qk(i): propor8on of individual i’s genome derived from popula8on k

•  Alleles at different lock can come from different popula8ons

16

Structure Model

•  Hypothesis: Modern popula8ons are created by an intermixing of ancestral popula8ons.

•  An individual’s genome contains contribu8ons from one or more ancestral popula8ons.

•  The contribu8ons of popula8ons can be different for different individuals.

•  Other assump8ons –  Hardy-‐weinberg equilbrium

–  No linkage disequilbrium –  Markers are i.i.d (independent and iden8cally distributed)

17

Linkage Model

•  From admixture model, replace the assump8on that the ancestry labels zil for individual i, locus l are independent with the assump8on that adjacent zil are correlated.

•  Use Poisson process to model the correla8on between neighboring alleles –  dl : distance between locus l and locus l+1 –  r: recombina8on rate

18

Linkage Model

•  As recombina8on rate r goes to infinity, all loci become independent and linkage model becomes admixture model.

•  Recombina8on rate r can be viewed as being related to the number of genera8ons since admixture occurred.

•  Use MCMC algorithm to fit the unkown parameters.

19

F Model

•  Introduce correla8ons in allele frequencies among ancestral popula8ons –  pAl: allele frequencies in ancestral popula8ons modeled as symmetric

Dirichlet distribu8on

–  Subpopula8ons of the ancestral popula8on go through gene8c driZ at different rate Fk

–  Individuals are admixture of those K popula8ons who went through gene8c driZ from the common ancestral popula8on

20

F Model

•  Rela8onship between Fk and FST

•  Designed to between closely related popula8ons with similar allele frequencies

21

Scenarios of How Populations Evolve

22

Unknown Parameters To Be Estimated

•  qi: the admixture propor8ons of individual i

•  pk: allele frequencies of popula8on k •  zi: popula8on label for each locus of individual i •  r : recombina8on rate •  Fk : es8mate of popula8on divergence from the ancestral

popula8on

23

Population Structure from Ancestry Proportion of Each Individual

•  How to display popula8on structure?

Genetic structure of Human Populations (Rosenberg et al., Science 2002)‏#

Africa Europe Mid-‐East Cent./S. Asia East Asia Oceania

Ancestral proportion

24

Population of Origin Assignments of a Single Individual

True origin

Es8mated Origin (Unphased data)

Es8mated Origin (Phased data)

25

Admixture vs Divergence

26

Posterior Distribution of Recombination Rate

•  Using the original dataset

•  AZer permu8ng the genotype loci

27

Distinguishing Between Two Closely Related Populations

28

Three Sources of Linkage Disequilibrium

•  Mixture LD –  Due to varia8on in ancestry across individuals that induce correla8on

among markers at different loci –  Modeled by admixture model

•  Admixture LD –  Due to unbroken chunks of DNA derived from an ancestor popula8on. –  Modeled by linkage model

•  Background LD –  Due to LD within popula8ons –  Decays at smaller scale

29

Low-dimensional Projections

•  Gene8c data is very large –  Number of markers may range from a few hundreds to hundreds of

thousands –  Thus each individual is described by a high-‐dimensional vector of marker

configura8ons –  A low-‐dimensional projec8on allows easy visualiza8on

•  Technique used –  Factor analysis –  Many sta8s8cal methods exist – ICA, PCA, NMF etc. –  Principal Components Analysis (next slide)

•  Allows projec8on of individuals into a low dimensional space

•  Usually projected to 2 dimensions to allow visualiza8on

30

Principal Component Analysis

•  Most common form of factor analysis

•  The new variables/dimensions ... –  Are linear combina8ons of the original ones

–  Are uncorrelated with one another •  Orthogonal in original dimension space

–  Capture as much of the original variance in the data as possible

–  Are called Principal Components

•  Demo at hJp://www.cs.mcgill.ca/~sqrt/dimr/dimreduc8on.html

31

What are the new axes?

Original Variable A

PC 1 PC 2

•  Orthogonal direc8ons of greatest variance in data •  Projec8ons along PC1 discriminate the data most along any one axis

Original Variable B

32

Principal Components

•  First principal component is the direc8on of greatest variability (covariance) in the data

•  Second is the next orthogonal (uncorrelated) direc8on of greatest variability –  So first remove all the variability along the first component, and then find the next direc8on of greatest variability

•  And so on …

33

Dimensionality Reduction

Can ignore the components of lesser significance.

You do lose some informa8on, but if the eigenvalues are small, you don’t lose much

–  n dimensions in original data –  calculate n eigenvectors and eigenvalues –  choose only the first p eigenvectors, based on their eigenvalues –  final data set has only p dimensions

34

PCA Analysis (Cavalli-sforza,1978)

•  Plot of geographical distribu8on of 3 PCs (Intensity propor8onal to value of each component) –  First – blue

–  Second -‐ green

–  Third -‐ red

35

Matrix Factorization and Population Structure

•  Matrix factoriza8on for learning popula8on structure

Genotype Data (NxP matrix)

N: number of samples P: number of genotypes

Individuals’ ancestry propor8ons (NxK matrix) K: number of subpopula8ons

Subpopula8on Allele Frequencies (KxP matrix) = x

36

Unifying Framework of Matrix Factorization

•  Admixture –  Based on probability models: rows of Λ and columns of F should sum

to 1. –  Works well if the individuals are admixtures of discretely separated

popula8ons

•  PCA –  Based on eigen decomposi8on: columns of Λ are orthogonal, rows of F

are orthnormal. –  Works well for the case of isola8on-‐by-‐distance (con8nuous varia8on

of popula8ons among individuals)

•  Sparse factor model –  Sparsity via automa8c relevance determina8on prior

37

Discrete/Admixed Populations

SFA

PCA

Admixture

Loading 1 Loading 2 Loading 3

38

Isolation-by-Distance Models

39

Clustered Populations in 1d Habitat •  SFA

•  Admixture

•  PCA

Assume two popula8ons

Assume five popula8ons

Assume two popula8ons

Assume five popula8ons

40

Analysis of European Genotype Data

PCA SFAm Admixture 41

Comparison of Different Methods

PCA Model-‐based Clustering

Advantages •  Sta8s8cal tests for significance of results (PaJerson et al. 2006) •  Easy visualiza8on

•  Genera8ve process that explicitly models admixture •  Clustering is probabilis8c: it is possible to assign confidence level of clusters

Disadvantages •  No intui8on about underlying processes

•  Computa8onally more demanding •  Based on assump8ons of evolu8onary models: •  Structure: No models of muta8on, recombina8on •  Muta8on added in mStruct •  Recombina8on added in extension by Falush et al.

42

Documents

Lecture 2: Population Structure