View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Workshop in BioinformaticsWorkshop in Bioinformatics
Eran Halperin
The Human Genome The Human Genome ProjectProject
“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”
“But our work previously has shown… that having one genetic code is important, but it's not all that useful.”
“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”
Washington, DCJune, 26, 2000
The Vision of Personalized The Vision of Personalized MedicineMedicine
Genetic and epigenetic variants + Genetic and epigenetic variants + measurable environmental/behavioral factors wouldmeasurable environmental/behavioral factors wouldbe used for a personalized treatment and diagnosisbe used for a personalized treatment and diagnosis
Example: WarfarinExample: Warfarin
An anticoagulant drug, useful in the prevention of thrombosis.
Warfarin was originallyused as rat poison.
Optimal dose variesacross the population
Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.
Example: WarfarinExample: Warfarin
Association StudiesAssociation Studies
AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC
AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC
Cases:
Controls: Associated SNP
Where should we look?Where should we look?SNP = Single Nucleotide PolymorphismUsually SNPs are bi-allelic
Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8
NHGRI GWA Catalogwww.genome.gov/GWAStudies
Environmental Factors
Genetic Factors
Complexdisease
Multiple genes may affect the disease.
Therefore, the effect of every single gene may be negligible.
How does it work?How does it work?
• For every pair of SNPs we can construct a contingency table:
A G Total
Cases a b n
Controls
c d n
€
n = a+b = c + d
p1 = a /n
p2 = c /n
p =p1 + p2
2
T =n(p1 − p2)2
p(1 − p)
Results: Manhattan PlotsResults: Manhattan Plots
The curse of dimensionality – The curse of dimensionality – corrections of multiple testingcorrections of multiple testing
• In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs.
• If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease.
• This needs to be corrected.
Bonferroni CorrectionBonferroni Correction
• If the number of tests is n, we set the threshold to be 0.05/n.
• A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad:– Example: If all SNPs are identical, then we
lose a lot of power; the false positive rate reduces, but so does the power.
DataData
HUJI 2006
International consortium that aims in genotyping the genome of 270 individuals from four different populations.
HUJI 2006
- Launched in 2002.- First phase (2005):
~1 million SNPs for 270 individuals from four populations- Second phase (2007):
~3.1 million SNPs for 270 individuals from four populations- Third phase (ongoing):
> 1 million SNPs for 1115 individuals across 11 populations
Other Data SourcesOther Data Sources
• Human Genome Diversity Project – 50 populations, 1000 individuals, 650k SNPs
• POPRES– 6000 individuals (controls)
• Encode Project– Resequencing, discovery of new SNPs
• 1000 Genomes project• dbGAP
HaplotypesHaplotypes
HaplotypesHaplotypes
• Can 1,000,000 SNPs tell us everything?
• No, but they can still tell us a lot about the rest of the genome.– SNPs in physical proximity are correlated. – A sequence of alleles along a
chromosome are called haplotypes.
Haplotype Data in a BlockHaplotype Data in a Block
(Daly et al., 2001) Block 6 from Chromosome 5q31
LD structureLD structure
Phasing - haplotype Phasing - haplotype inferenceinference
• Cost effective genotyping technology gives genotypes and not haplotypes.
Haplotypes Genotype
A
CCG
A
C
G
TA
ATCCGAAGACGC
ATACGAAGCCGC
Possiblephases:
AGACGAATCCGC ….
mother chromosomefather chromosome
25
1??11?1??11?
?100???100??
1?0???1?0???
10?11?11?11?
1100??0100??
100???110???
1??11?1??11?
1100??0100??
1?0???1?0???
10011?11111?
11000?01001?
10011?11000?
Inferring Haplotypes From Inferring Haplotypes From TriosTrios
Parent 1
Parent 2
Child
122112
210022
120222
Assumption: No recombination
Population Substructure
• Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to
be found– The vast majority of them are false;Why ???
Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.
Natural SelectionNatural Selection
• Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene
different allele frequencies in LCT
Genetic Drift
• Even without selection, the allele frequencies in the population are not fixed across time.
• Consider the following case:
– We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population.
– We assume a constant population size, no mutation, no selection
Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel
Generation 1Allele frequency 1/9
Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel
Generation 2Allele frequency 1/9
Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel
Generation 3Allele frequency 1/9
Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel
Generation 4Allele frequency 1/3
Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel
Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel
Ancestral population
Ancestral population
migration
Ancestral population
Genetic drift
different allele frequencies
Population Substructure
• Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to
be found– The vast majority of them are false;
What can we do about it?
Jakobsson et al, Nature 421: 998-103
Principal Component Principal Component AnalysisAnalysis
• Dimensionality reduction• Based on linear algebra• Intuition: find the ‘most important’
features of the data
Principal Component Principal Component AnalysisAnalysis
Plotting the data on a onedimensional line for which the ‘spread’ is maximized.
Principal Component Principal Component AnalysisAnalysis
• In our case, we want to look at two dimensions at a time.
• The original data has many dimensions – each SNP corresponds to one dimension.
HapMap PopulationsHapMap Populations
43
CEUCEU
ASWASW
CHBCHBCHD
CHD
GIHGIH
JPTJPT
LWK
LWK
MEX
MEX
MKK
MKK
TSITSI
YRIYRI
HapMap PCA 1-2HapMap PCA 1-2
44
HapMap PCA 1-3HapMap PCA 1-3
45
HapMap PCA 1,2,4HapMap PCA 1,2,4
46
Ancestry Inference:Ancestry Inference:• To what extent can population structure be detected
from SNP data? • What can we learn from these inferences?
Novembre et al., 2008
Ancestry inference in recently admixed Ancestry inference in recently admixed populationspopulations
100%
0%
20%
40%
60%
80%
1 4 7
10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88Percentracial
admixtureIndividual subjects 1-90
Puerto Rican Population (GALA study, E. Burchard)
European
Native American
African
Recombination EventsRecombination Events
Copy 1
Copy 2
child chromosome
Probability ri for recombinationin position i.
Recently Admixed PopulationsRecently Admixed Populations
AfterAfter generation 1generation 1
Recently Admixed PopulationsRecently Admixed Populations
AfterAfter generationgeneration 22
Recently Admixed PopulationsRecently Admixed Populations
After generation 10After generation 10
Chromosome
W Recombination Indicators g Generations
Z Ancestral states r Recombination rate
X Alleles α Admixture fraction
p,q Allele frequencies
Overall AccuracyOverall Accuracy
Applications:• Population genetics (admixture events, recombination
events, selection forces, migration patterns)• Potential applications in personalized medicine• Finding new associations (through admixture mapping)
55
Admixture MappingAdmixture Mapping