Statistical Tools for Predicting Ancestry from Genetic Datafaculty.washington.edu/tathornt/IPC2015/IPC... · Single base pair mutations create Single Nucleotide Polymorphisms (SNPs)

Statistical Tools for Predicting Ancestry from GeneticData

Timothy Thornton

Department of BiostatisticsUniversity of Washington

March 1, 2015

1 / 33

Basic Genetic TerminologyA gene is the most fundamental unit of heredity that controls thetransmission and expression of one or more traits.Genes are sequences of DNA (deoxyribonucleic acid) and can befound on chromosomes.The four bases of DNA are adenine, guanine, cytosine, and thymine(A, G, C, and T). We can think of DNA as a sequence of these fourletters....TATACGTGCACCTGGATTACAGATTACAGATTACAGATTACATTGCATCGATCGAA...

2 / 33

Basic Genetic Terminology

Chromosomes are long strands of DNA. In humans, each celltypically contains 23 pairs of chromosomes. For any chromosome pair,one chromosome is inherited from the father of the individual and theother chromosome is inherited from the mother of the individual.

3 / 33

Genetic Mutations

DNA strands are formed by sequences of base-pairs

99.9% of the 3 billion base pairs in the human genome are identicalfor all humans

0.1% differs because of mutations/deletions/insertions

Most common mutation: a single base pair

4 / 33

Genetic Mutations

DNA strands are formed by sequences of base-pairs

99.9% of the 3 billion base pairs in the human genome are identicalfor all humans

0.1% differs because of mutations/deletions/insertions

Most common mutation: a single base pair

5 / 33

Single Base Pair Mutations

Single base pair mutations create Single NucleotidePolymorphisms (SNPs)

The two variations of a SNP are called the alleles of the SNPI Original base pair - Major allele (d)I Mutated base pair - Minor allele (D)

For a chromosome pair, the two alleles at a SNP are called anindividual’s genotype, where one allele is passed down from themother and the other allele from the father.

The are generally three possible genotypes for a SNP in a population:DD, Dd , or dd

6 / 33

Single Nucleotide Polymorphisms (SNPS)

SNPs occur randomly throughout the DNA

More than 20 Million SNPs in the human genome have been mapped

Most SNPs have no effect (Non-functional SNPs)

A little less than 1% are believed to damage the function of a proteinor regulatory element (Functional SNPs)

7 / 33

Introduction

Genetic association studies are widely used for the identification ofgenes that influence human traits.

To date, hundreds of thousands of individuals have been included ingenome-wide association studies (GWAS) for the mapping of complextraits.

Large-scale genomic studies often have high-dimensional dataconsisting of

I Tens of thousands of individualsI Genotypes data on a million (or more!) SNPs for all individuals in the

studyI Phenotype or Trait values of interest such as Height, BMI, HDL

cholesterol, blood pressure, diabetes, etc.

The vast majority of these studies, however, have been conducted inpopulations of European ancestry

Non-European populations have largely been underrepresented ingenetic studies, despite often bearing a disproportionately high burdenfor some common diseases.

8 / 33

Introduction

Several recent and ongoing genetic studies have focused on recentlyadmixed populations: populations who have experienced admixingamong continentally divided ancestral populations within the past 200to 500 hundred years.

Admixed populations have largely arisen as a consequence ofhistorical events such as the transatlantic slave trade, the colonizationof the Americas and other long-distance migrations.

Examples of admixed populations includeI African Americans and Hispanics in the U.SI Latinos from throughout Latin AmericaI Uyghur population of Central AsiaI Cape VerdeansI South African ”Coloured” population

9 / 33

Ancestry Admixture

Ancestral�Pop. B

Ancestral�Pop. A

Today

The chromosomes of an admixed individual represent a mosaic ofchromosomal blocks from the ancestral populations.

Can be substantial genetic heterogeneity in admixed populations

10 / 33

Association Mapping in Admixed Populations

Genetic association studies in recently admixed populations offerexciting opportunities for the identification of variants underlyingphenotypic diversity in populations.

Two ongoing genetic studies focusing on recently admixedpopulations:

I Women’s Health Initiative: Minority cohort of 12,008 AfricanAmericans and Hispanics

I The Hispanic Community Heath Study / Study of Latinos(HCHS/SoL) is, to date, the largest genetic study of Hispanics/Latinoswith more than 16,000 participants

11 / 33

Opportunities and Challenges

Genetic association studies in recently admixed populations offerexciting opportunities for the identification of novel variantsunderlying phenotypic diversity in populations.

There are several challenges with genetic association mapping insamples from admixed populations

Heterogeneous genomes of individuals from admixed populations canbe a confounder leading to spurious association

Big Data! More than 1 million SNPs directly assayed (and with 20million additional SNPs imputed to 1000 Genomes)

12 / 33

Confounding due to Ancestry

Ethnicgroups (and subgroups) often share distinct dietary habits andother lifestyle characteristics that leads to many traits of interestbeing correlated with ancestry and/or ethnicity.

13 / 33

Case-Control Association TestingBelow is a simple example to illustrate association testing at a geneticmarker with two allelic types labeled A and aStatistics for identifying an association could compare allelefrequencies (A or a) between the cases (affected individuals) and thecontrols (unaffected individuals).Genotype frequencies (AA, Aa, or aa) could also be comparedbetween the two groups.

Cases

AA Aa AA

AA AA Aa

Controls

Aa aa Aa

Aa Aa aa

14 / 33

Spurious Association

Case/Control association testI Comparison of allele frequency between cases and controls.

Consider a sample from 2 populations:

I Red population overrepresented among cases in the sample.I Genetic markers that are not influencing the disease but with

significant differences in allele frequencies between the populations=⇒ spurious association between disease and genetic marker

15 / 33

Spurious Association

Quantitative trait association testI Test for association between genotype and trait value

Consider sampling from 2 populations:

Histogram of Trait Values

Population 1Population 2

I Blue population has higher trait values.I Different allele frequency in each population

=⇒ spurious association between trait and genetic marker for samplescontaining individuals from both populations

16 / 33

Identifying Genetic Ancestry Differences:Unsupervised Learning

Suppose a genetic association study consists of a sample of Nindividuals

Assume that genotype data is available at S SNPs in agenome-screen, where S can be very large (e.g. hundreds ofthousands).

For SNP s define Gs = (G s1 , . . .G

sn )T is n×1 vector of the genotypes,

where G si = 0, 12 , or 1, according to whether individual i has,

respectively, 0, 1 or 2 copies of the reference allele at SNP s.

We define Z to be an N×S standardized matrix with (i ,s)-th entry

Zis =G si − p̂s√

p̂s(1− p̂s)

and p̂s will typically be an allele frequency estimate for SNP scalculated using all individuals in the sample.

17 / 33


Can obtain a genetic similarity matrix (GSM) Ψ̂ where

Ψ̂ =1

SZZT

The (i , j)-th entry of Ψ̂ is a measure of the average genetic similarityfor individuals i and j in the sample.

Ψ̂ij =1

S

S

∑s=1

(G si − p̂s)(G s

j − p̂s)

p̂s(1− p̂s)

Principal Components Analysis (PCA) is a dimension reduction thatcan be applied to GSM to identify ancestry differences among sampleindividuals

PCA is performed by obtaining the eigendecomposition of the GSMΨ̂.

18 / 33


Orthogonal axes of variation, i.e. linear combinations of SNPs, thatbest explain the genotypic variability amongst the n sampleindividuals are identified.

For the eigendecomposition we have Ψ̂ = VDVT , whereV = [V1,V2, . . . ,Vn] is an n×n matrix with orthogonal columnvectors, and D corresponding to a diagonal matrix of the length neigenvalue vector Λ = (λ1,λ2, . . . ,λn)T

The eigenvalues are in decreasing order, λ1 > λ2 > .. . > λn. The d th

principal component (eigenvector) corresponds to eigenvalue λd ,where λd is proportional to the percentage of variability in thegenome-screen data that is explained by Vd .

19 / 33

PCA of Europeans

The top principal components are viewed as continuous axes ofvariation that reflect genetic variation due to ancestry in the sample.

Individuals with similar values for a particular top principal componentwill have similar ancestry for that axes.

A application of principal components to genetic data from Europeansamples (Novembre et al., Nature 2008) showed that amongEuropeans for whom all four grandparents originated in the samecountry, the first two principal components computed using 200,000SNPs could map their country of origin quite accurately in the plane

20 / 33

PCA of Europeans

21 / 33

Ancestry Inference with PCA for HCHS/SoL

Caribbean

Central/South American

African

European

East Asian

Central/South American

1

22 / 33

Ancestry Inference with PCA for HCHS/SoL

−0.02 −0.01 0.00 0.01 0.02

−0.

04−

0.03

−0.

02−

0.01

0.00

0.01

0.02

PC1

PC

2

−0.01 0.00 0.01 0.02 0.03

−0.

03−

0.02

−0.

010.

000.

010.

020.

03

PC3

PC

4

−0.04 −0.02 0.00 0.02 0.04 0.06

−0.

25−

0.20

−0.

15−

0.10

−0.

050.

00

PC5

PC

6

●

●

●

●

●

●

●

●

DominicanCentralAmCubanMexicanPuertoRicanSouthAmMulti/OtherUnknown

−0.04 −0.02 0.00 0.02 0.04

−0.

04−

0.02

0.00

0.02

0.04

PC7

PC

8

23 / 33

Supervised Learning for Ancestry Admixture

The chromosomes of an admixed individual represent a mosaic ofchromosomal blocks from the ancestral populations.

24 / 33

Supervised Learning for Ancestry Admixture

Methods have recently been developed for supervised learning ofancestry proportions for an admixed individuals using high-densitySNP data.

Most use either a hidden Markov model (HMM) or anExpectation-Maximization (EM) algorithm to infer ancestry

Example: Suppose we are interested in identifying the ancestryproportions for an admixed individual

Observed sequence on a chromosome for an admixed individual:...TATACGTGCACCTGGATTACAGATTACAGATTACAGATTACATTGCATCGATCGAA...

Observed sequence on a chromosome for samples selected from a”homogenous” reference population:...TGATCCTGAACCTAGATTACAGATTACAGATTACAGATTACAATGCTTCGATGGAC......AGATCCTGAACCTAGATTACAGATTACAGATTACAGATACCAATGCTTCGATGGAC......CGATCCTGAACCTAGATTACAGATTACAGATTTGCGTATACAATGCTTCGATGGAC...

25 / 33

Women’s Health Initiative

The Womens Health Initiative (WHI) is a national health studyfocusing on strategies for preventing chronic diseases inpostmenopausal women.

A total of 161,808 women aged 50-79 yrs. old were recruited from 40clinical centers in the US between 1993 and 1998.

The WHI SNP Health Association Resource (SHARe) minority cohortincludes 8421 self-identified African American women from and 3587self-identified Hispanic women

Genotype data for 1 million SNPs from across the genome

26 / 33

Supervised Ancestry Estimation:WHI-SHARe

Genome-screen data was used to estimate the admixture proportions

Estimated genome-wide ancestry proportions of every individual usingFRAPPE (Tang et al., 2005) software

Used genotype data from available reference panels for three”ancestral” populations

I HapMap YRI for West African ancestryI HapMap CEU samples for northern and western European ancestryI HGDP Native American samples for Native American ancestry.

27 / 33

Supervised Ancestry InferenceAncestry of WHI African Americans

Individual

Anc

estr

y

0.0

0.2

0.4

0.6

0.8

1.0

EuropeanAfricanNative American

28 / 33

Table: Average Estimated Ancestry Proportions for Self-reported AfricanAmericans

Estimated Ancestry Proportions (SD)Population European African Native American

African Americans .209 (.153) .767 (.155) .024 (.023)

29 / 33

WHI Hispanic Subpopulations

Of the total cohort of WHI Hispanics, 3,036 (or 84.6%) reportedspecific nationality of origin in a follow-up questionnaire (Form 41).

I 395 labeled themselves as Puerto RicanI 1,548 as Mexican or ChicanoI 258 as CubanI The remaining 835 as Latino from a category other than those three.

30 / 33

WHI−SHARE Ancestry Admixture: Puerto Ricans

Puerto Rican Individuals

Cum

ulat

ive

Anc

estr

y

0.0

0.3

0.6

0.9

WHI−SHARE Ancestry Admixture: Mexicans

Mexican Individuals

Cum

ulat

ive

Anc

estr

y

0.0

0.3

0.6

0.9

WHI−SHARE Ancestry Admixture: Cubans

Cuban Individuals

Cum

ulat

ive

Anc

estr

y

0.0

0.3

0.6

0.9

31 / 33

Table: Average Estimated Ancestry Proportions within Self-reported EthnicGroups

Estimated Ancestry Proportions (SD)Population European African Native American

Puerto Ricans .648 (.216) .215 (.208) .136 (.078)Mexicans .547 (.160) .030 (.058) .423 (.158)Cubans .840 (.197) .098 (.179) .062 (.099)

32 / 33

Acknowledgement of Research Funding

Research was supported in part by the National Institutes of Health grants:

NCI K01CA148958

NIGMS P01GM099568

NHLBI HHSN268201300005

33 / 33

Contact Information

Please feel free to contact me:

Timothy Thornton, PhD

email: [email protected]

phone: 206-543-8004

34 / 33

Using R for Ancestry Inference

We now will use R to perform PCA on real genetic data for ancestryinference

Access the short course website for the data files, exercises, and Rscript:

http://faculty.washington.edu/tathornt/IPC2015.html

35 / 33

Documents

Statistical Tools for Predicting Ancestry from Genetic Datafaculty.washington.edu/tathornt/IPC2015/IPC... · Single base pair mutations create Single Nucleotide Polymorphisms (SNPs)