Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Statistical Tools for Predicting Ancestry from GeneticData
Timothy Thornton
Department of BiostatisticsUniversity of Washington
March 1, 2015
1 / 33
Basic Genetic TerminologyA gene is the most fundamental unit of heredity that controls thetransmission and expression of one or more traits.Genes are sequences of DNA (deoxyribonucleic acid) and can befound on chromosomes.The four bases of DNA are adenine, guanine, cytosine, and thymine(A, G, C, and T). We can think of DNA as a sequence of these fourletters....TATACGTGCACCTGGATTACAGATTACAGATTACAGATTACATTGCATCGATCGAA...
2 / 33
Basic Genetic Terminology
Chromosomes are long strands of DNA. In humans, each celltypically contains 23 pairs of chromosomes. For any chromosome pair,one chromosome is inherited from the father of the individual and theother chromosome is inherited from the mother of the individual.
3 / 33
Genetic Mutations
DNA strands are formed by sequences of base-pairs
99.9% of the 3 billion base pairs in the human genome are identicalfor all humans
0.1% differs because of mutations/deletions/insertions
Most common mutation: a single base pair
4 / 33
Genetic Mutations
DNA strands are formed by sequences of base-pairs
99.9% of the 3 billion base pairs in the human genome are identicalfor all humans
0.1% differs because of mutations/deletions/insertions
Most common mutation: a single base pair
5 / 33
Single Base Pair Mutations
Single base pair mutations create Single NucleotidePolymorphisms (SNPs)
The two variations of a SNP are called the alleles of the SNPI Original base pair - Major allele (d)I Mutated base pair - Minor allele (D)
For a chromosome pair, the two alleles at a SNP are called anindividual’s genotype, where one allele is passed down from themother and the other allele from the father.
The are generally three possible genotypes for a SNP in a population:DD, Dd , or dd
6 / 33
Single Nucleotide Polymorphisms (SNPS)
SNPs occur randomly throughout the DNA
More than 20 Million SNPs in the human genome have been mapped
Most SNPs have no effect (Non-functional SNPs)
A little less than 1% are believed to damage the function of a proteinor regulatory element (Functional SNPs)
7 / 33
Introduction
Genetic association studies are widely used for the identification ofgenes that influence human traits.
To date, hundreds of thousands of individuals have been included ingenome-wide association studies (GWAS) for the mapping of complextraits.
Large-scale genomic studies often have high-dimensional dataconsisting of
I Tens of thousands of individualsI Genotypes data on a million (or more!) SNPs for all individuals in the
studyI Phenotype or Trait values of interest such as Height, BMI, HDL
cholesterol, blood pressure, diabetes, etc.
The vast majority of these studies, however, have been conducted inpopulations of European ancestry
Non-European populations have largely been underrepresented ingenetic studies, despite often bearing a disproportionately high burdenfor some common diseases.
8 / 33
Introduction
Several recent and ongoing genetic studies have focused on recentlyadmixed populations: populations who have experienced admixingamong continentally divided ancestral populations within the past 200to 500 hundred years.
Admixed populations have largely arisen as a consequence ofhistorical events such as the transatlantic slave trade, the colonizationof the Americas and other long-distance migrations.
Examples of admixed populations includeI African Americans and Hispanics in the U.SI Latinos from throughout Latin AmericaI Uyghur population of Central AsiaI Cape VerdeansI South African ”Coloured” population
9 / 33
Ancestry Admixture
Ancestral�Pop. B
Ancestral�Pop. A
Today
The chromosomes of an admixed individual represent a mosaic ofchromosomal blocks from the ancestral populations.
Can be substantial genetic heterogeneity in admixed populations
10 / 33
Association Mapping in Admixed Populations
Genetic association studies in recently admixed populations offerexciting opportunities for the identification of variants underlyingphenotypic diversity in populations.
Two ongoing genetic studies focusing on recently admixedpopulations:
I Women’s Health Initiative: Minority cohort of 12,008 AfricanAmericans and Hispanics
I The Hispanic Community Heath Study / Study of Latinos(HCHS/SoL) is, to date, the largest genetic study of Hispanics/Latinoswith more than 16,000 participants
11 / 33
Opportunities and Challenges
Genetic association studies in recently admixed populations offerexciting opportunities for the identification of novel variantsunderlying phenotypic diversity in populations.
There are several challenges with genetic association mapping insamples from admixed populations
Heterogeneous genomes of individuals from admixed populations canbe a confounder leading to spurious association
Big Data! More than 1 million SNPs directly assayed (and with 20million additional SNPs imputed to 1000 Genomes)
12 / 33
Confounding due to Ancestry
Ethnicgroups (and subgroups) often share distinct dietary habits andother lifestyle characteristics that leads to many traits of interestbeing correlated with ancestry and/or ethnicity.
13 / 33
Case-Control Association TestingBelow is a simple example to illustrate association testing at a geneticmarker with two allelic types labeled A and aStatistics for identifying an association could compare allelefrequencies (A or a) between the cases (affected individuals) and thecontrols (unaffected individuals).Genotype frequencies (AA, Aa, or aa) could also be comparedbetween the two groups.
Cases
AA Aa AA
AA AA Aa
Controls
Aa aa Aa
Aa Aa aa
14 / 33
Spurious Association
Case/Control association testI Comparison of allele frequency between cases and controls.
Consider a sample from 2 populations:
I Red population overrepresented among cases in the sample.I Genetic markers that are not influencing the disease but with
significant differences in allele frequencies between the populations=⇒ spurious association between disease and genetic marker
15 / 33
Spurious Association
Quantitative trait association testI Test for association between genotype and trait value
Consider sampling from 2 populations:
Histogram of Trait Values
Population 1Population 2
I Blue population has higher trait values.I Different allele frequency in each population
=⇒ spurious association between trait and genetic marker for samplescontaining individuals from both populations
16 / 33
Identifying Genetic Ancestry Differences:Unsupervised Learning
Suppose a genetic association study consists of a sample of Nindividuals
Assume that genotype data is available at S SNPs in agenome-screen, where S can be very large (e.g. hundreds ofthousands).
For SNP s define Gs = (G s1 , . . .G
sn )T is n×1 vector of the genotypes,
where G si = 0, 12 , or 1, according to whether individual i has,
respectively, 0, 1 or 2 copies of the reference allele at SNP s.
We define Z to be an N×S standardized matrix with (i ,s)-th entry
Zis =G si − p̂s√
p̂s(1− p̂s)
and p̂s will typically be an allele frequency estimate for SNP scalculated using all individuals in the sample.
17 / 33
Identifying Genetic Ancestry Differences:Unsupervised Learning
Can obtain a genetic similarity matrix (GSM) Ψ̂ where
Ψ̂ =1
SZZT
The (i , j)-th entry of Ψ̂ is a measure of the average genetic similarityfor individuals i and j in the sample.
Ψ̂ij =1
S
S
∑s=1
(G si − p̂s)(G s
j − p̂s)
p̂s(1− p̂s)
Principal Components Analysis (PCA) is a dimension reduction thatcan be applied to GSM to identify ancestry differences among sampleindividuals
PCA is performed by obtaining the eigendecomposition of the GSMΨ̂.
18 / 33
Identifying Genetic Ancestry Differences:Unsupervised Learning
Orthogonal axes of variation, i.e. linear combinations of SNPs, thatbest explain the genotypic variability amongst the n sampleindividuals are identified.
For the eigendecomposition we have Ψ̂ = VDVT , whereV = [V1,V2, . . . ,Vn] is an n×n matrix with orthogonal columnvectors, and D corresponding to a diagonal matrix of the length neigenvalue vector Λ = (λ1,λ2, . . . ,λn)T
The eigenvalues are in decreasing order, λ1 > λ2 > .. . > λn. The d th
principal component (eigenvector) corresponds to eigenvalue λd ,where λd is proportional to the percentage of variability in thegenome-screen data that is explained by Vd .
19 / 33
PCA of Europeans
The top principal components are viewed as continuous axes ofvariation that reflect genetic variation due to ancestry in the sample.
Individuals with similar values for a particular top principal componentwill have similar ancestry for that axes.
A application of principal components to genetic data from Europeansamples (Novembre et al., Nature 2008) showed that amongEuropeans for whom all four grandparents originated in the samecountry, the first two principal components computed using 200,000SNPs could map their country of origin quite accurately in the plane
20 / 33
PCA of Europeans
21 / 33
Ancestry Inference with PCA for HCHS/SoL
Caribbean
Central/South American
African
European
East Asian
Central/South American
1
22 / 33
Ancestry Inference with PCA for HCHS/SoL
−0.02 −0.01 0.00 0.01 0.02
−0.
04−
0.03
−0.
02−
0.01
0.00
0.01
0.02
PC1
PC
2
−0.01 0.00 0.01 0.02 0.03
−0.
03−
0.02
−0.
010.
000.
010.
020.
03
PC3
PC
4
−0.04 −0.02 0.00 0.02 0.04 0.06
−0.
25−
0.20
−0.
15−
0.10
−0.
050.
00
PC5
PC
6
●
●
●
●
●
●
●
●
DominicanCentralAmCubanMexicanPuertoRicanSouthAmMulti/OtherUnknown
−0.04 −0.02 0.00 0.02 0.04
−0.
04−
0.02
0.00
0.02
0.04
PC7
PC
8
23 / 33
Supervised Learning for Ancestry Admixture
The chromosomes of an admixed individual represent a mosaic ofchromosomal blocks from the ancestral populations.
24 / 33
Supervised Learning for Ancestry Admixture
Methods have recently been developed for supervised learning ofancestry proportions for an admixed individuals using high-densitySNP data.
Most use either a hidden Markov model (HMM) or anExpectation-Maximization (EM) algorithm to infer ancestry
Example: Suppose we are interested in identifying the ancestryproportions for an admixed individual
Observed sequence on a chromosome for an admixed individual:...TATACGTGCACCTGGATTACAGATTACAGATTACAGATTACATTGCATCGATCGAA...
Observed sequence on a chromosome for samples selected from a”homogenous” reference population:...TGATCCTGAACCTAGATTACAGATTACAGATTACAGATTACAATGCTTCGATGGAC......AGATCCTGAACCTAGATTACAGATTACAGATTACAGATACCAATGCTTCGATGGAC......CGATCCTGAACCTAGATTACAGATTACAGATTTGCGTATACAATGCTTCGATGGAC...
25 / 33
Women’s Health Initiative
The Womens Health Initiative (WHI) is a national health studyfocusing on strategies for preventing chronic diseases inpostmenopausal women.
A total of 161,808 women aged 50-79 yrs. old were recruited from 40clinical centers in the US between 1993 and 1998.
The WHI SNP Health Association Resource (SHARe) minority cohortincludes 8421 self-identified African American women from and 3587self-identified Hispanic women
Genotype data for 1 million SNPs from across the genome
26 / 33
Supervised Ancestry Estimation:WHI-SHARe
Genome-screen data was used to estimate the admixture proportions
Estimated genome-wide ancestry proportions of every individual usingFRAPPE (Tang et al., 2005) software
Used genotype data from available reference panels for three”ancestral” populations
I HapMap YRI for West African ancestryI HapMap CEU samples for northern and western European ancestryI HGDP Native American samples for Native American ancestry.
27 / 33
Supervised Ancestry InferenceAncestry of WHI African Americans
Individual
Anc
estr
y
0.0
0.2
0.4
0.6
0.8
1.0
EuropeanAfricanNative American
28 / 33
Table: Average Estimated Ancestry Proportions for Self-reported AfricanAmericans
Estimated Ancestry Proportions (SD)Population European African Native American
African Americans .209 (.153) .767 (.155) .024 (.023)
29 / 33
WHI Hispanic Subpopulations
Of the total cohort of WHI Hispanics, 3,036 (or 84.6%) reportedspecific nationality of origin in a follow-up questionnaire (Form 41).
I 395 labeled themselves as Puerto RicanI 1,548 as Mexican or ChicanoI 258 as CubanI The remaining 835 as Latino from a category other than those three.
30 / 33
WHI−SHARE Ancestry Admixture: Puerto Ricans
Puerto Rican Individuals
Cum
ulat
ive
Anc
estr
y
0.0
0.3
0.6
0.9
WHI−SHARE Ancestry Admixture: Mexicans
Mexican Individuals
Cum
ulat
ive
Anc
estr
y
0.0
0.3
0.6
0.9
WHI−SHARE Ancestry Admixture: Cubans
Cuban Individuals
Cum
ulat
ive
Anc
estr
y
0.0
0.3
0.6
0.9
31 / 33
Table: Average Estimated Ancestry Proportions within Self-reported EthnicGroups
Estimated Ancestry Proportions (SD)Population European African Native American
Puerto Ricans .648 (.216) .215 (.208) .136 (.078)Mexicans .547 (.160) .030 (.058) .423 (.158)Cubans .840 (.197) .098 (.179) .062 (.099)
32 / 33
Acknowledgement of Research Funding
Research was supported in part by the National Institutes of Health grants:
NCI K01CA148958
NIGMS P01GM099568
NHLBI HHSN268201300005
33 / 33
Contact Information
Please feel free to contact me:
Timothy Thornton, PhD
email: [email protected]
phone: 206-543-8004
34 / 33
Using R for Ancestry Inference
We now will use R to perform PCA on real genetic data for ancestryinference
Access the short course website for the data files, exercises, and Rscript:
http://faculty.washington.edu/tathornt/IPC2015.html
35 / 33