55
Workshop in Bioinformatics Workshop in Bioinformatics Eran Halperin

Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Workshop in BioinformaticsWorkshop in Bioinformatics

Eran Halperin

Page 2: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

The Human Genome The Human Genome ProjectProject

“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“But our work previously has shown… that having one genetic code is important, but it's not all that useful.”

“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DCJune, 26, 2000

Page 3: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

The Vision of Personalized The Vision of Personalized MedicineMedicine

Genetic and epigenetic variants + Genetic and epigenetic variants + measurable environmental/behavioral factors wouldmeasurable environmental/behavioral factors wouldbe used for a personalized treatment and diagnosisbe used for a personalized treatment and diagnosis

Page 4: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Example: WarfarinExample: Warfarin

An anticoagulant drug, useful in the prevention of thrombosis.

Page 5: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Warfarin was originallyused as rat poison.

Optimal dose variesacross the population

Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.

Example: WarfarinExample: Warfarin

Page 6: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Association StudiesAssociation Studies

Page 7: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

Cases:

Controls: Associated SNP

Where should we look?Where should we look?SNP = Single Nucleotide PolymorphismUsually SNPs are bi-allelic

Page 8: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8

NHGRI GWA Catalogwww.genome.gov/GWAStudies

Page 9: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering
Page 10: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Environmental Factors

Genetic Factors

Complexdisease

Multiple genes may affect the disease.

Therefore, the effect of every single gene may be negligible.

Page 11: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

How does it work?How does it work?

• For every pair of SNPs we can construct a contingency table:

A G Total

Cases a b n

Controls

c d n

n = a+b = c + d

p1 = a /n

p2 = c /n

p =p1 + p2

2

T =n(p1 − p2)2

p(1 − p)

Page 12: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Results: Manhattan PlotsResults: Manhattan Plots

Page 13: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

The curse of dimensionality – The curse of dimensionality – corrections of multiple testingcorrections of multiple testing

• In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs.

• If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease.

• This needs to be corrected.

Page 14: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Bonferroni CorrectionBonferroni Correction

• If the number of tests is n, we set the threshold to be 0.05/n.

• A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad:– Example: If all SNPs are identical, then we

lose a lot of power; the false positive rate reduces, but so does the power.

Page 15: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

DataData

Page 16: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HUJI 2006

International consortium that aims in genotyping the genome of 270 individuals from four different populations.

Page 17: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HUJI 2006

- Launched in 2002.- First phase (2005):

~1 million SNPs for 270 individuals from four populations- Second phase (2007):

~3.1 million SNPs for 270 individuals from four populations- Third phase (ongoing):

> 1 million SNPs for 1115 individuals across 11 populations

Page 18: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Other Data SourcesOther Data Sources

• Human Genome Diversity Project – 50 populations, 1000 individuals, 650k SNPs

• POPRES– 6000 individuals (controls)

• Encode Project– Resequencing, discovery of new SNPs

• 1000 Genomes project• dbGAP

Page 19: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HaplotypesHaplotypes

Page 20: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HaplotypesHaplotypes

• Can 1,000,000 SNPs tell us everything?

• No, but they can still tell us a lot about the rest of the genome.– SNPs in physical proximity are correlated. – A sequence of alleles along a

chromosome are called haplotypes.

Page 21: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Haplotype Data in a BlockHaplotype Data in a Block

(Daly et al., 2001) Block 6 from Chromosome 5q31

Page 22: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

LD structureLD structure

Page 23: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Phasing - haplotype Phasing - haplotype inferenceinference

• Cost effective genotyping technology gives genotypes and not haplotypes.

Haplotypes Genotype

A

CCG

A

C

G

TA

ATCCGAAGACGC

ATACGAAGCCGC

Possiblephases:

AGACGAATCCGC ….

mother chromosomefather chromosome

Page 24: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

25

1??11?1??11?

?100???100??

1?0???1?0???

10?11?11?11?

1100??0100??

100???110???

1??11?1??11?

1100??0100??

1?0???1?0???

10011?11111?

11000?01001?

10011?11000?

Inferring Haplotypes From Inferring Haplotypes From TriosTrios

Parent 1

Parent 2

Child

122112

210022

120222

Assumption: No recombination

Page 25: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Population Substructure

• Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to

be found– The vast majority of them are false;Why ???

Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.

Page 26: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Natural SelectionNatural Selection

• Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene

different allele frequencies in LCT

Page 27: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift

• Even without selection, the allele frequencies in the population are not fixed across time.

• Consider the following case:

– We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population.

– We assume a constant population size, no mutation, no selection

Page 28: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Generation 1Allele frequency 1/9

Page 29: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Generation 2Allele frequency 1/9

Page 30: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Generation 3Allele frequency 1/9

Page 31: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Generation 4Allele frequency 1/3

Page 32: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Page 33: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Page 34: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Ancestral population

Page 35: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Ancestral population

migration

Page 36: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Ancestral population

Genetic drift

different allele frequencies

Page 37: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Population Substructure

• Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to

be found– The vast majority of them are false;

What can we do about it?

Page 38: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Jakobsson et al, Nature 421: 998-103

Page 39: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Principal Component Principal Component AnalysisAnalysis

• Dimensionality reduction• Based on linear algebra• Intuition: find the ‘most important’

features of the data

Page 40: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Principal Component Principal Component AnalysisAnalysis

Plotting the data on a onedimensional line for which the ‘spread’ is maximized.

Page 41: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Principal Component Principal Component AnalysisAnalysis

• In our case, we want to look at two dimensions at a time.

• The original data has many dimensions – each SNP corresponds to one dimension.

Page 42: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HapMap PopulationsHapMap Populations

43

CEUCEU

ASWASW

CHBCHBCHD

CHD

GIHGIH

JPTJPT

LWK

LWK

MEX

MEX

MKK

MKK

TSITSI

YRIYRI

Page 43: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HapMap PCA 1-2HapMap PCA 1-2

44

Page 44: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HapMap PCA 1-3HapMap PCA 1-3

45

Page 45: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

HapMap PCA 1,2,4HapMap PCA 1,2,4

46

Page 46: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Ancestry Inference:Ancestry Inference:• To what extent can population structure be detected

from SNP data? • What can we learn from these inferences?

Novembre et al., 2008

Page 47: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Ancestry inference in recently admixed Ancestry inference in recently admixed populationspopulations

100%

0%

20%

40%

60%

80%

1 4 7

10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88Percentracial

admixtureIndividual subjects 1-90

Puerto Rican Population (GALA study, E. Burchard)

European

Native American

African

Page 48: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Recombination EventsRecombination Events

Copy 1

Copy 2

child chromosome

Probability ri for recombinationin position i.

Page 49: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Recently Admixed PopulationsRecently Admixed Populations

AfterAfter generation 1generation 1

Page 50: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Recently Admixed PopulationsRecently Admixed Populations

AfterAfter generationgeneration 22

Page 51: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Recently Admixed PopulationsRecently Admixed Populations

After generation 10After generation 10

Page 52: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Chromosome

W Recombination Indicators g Generations

Z Ancestral states r Recombination rate

X Alleles α Admixture fraction

p,q Allele frequencies

Page 53: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Overall AccuracyOverall Accuracy

Page 54: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Applications:• Population genetics (admixture events, recombination

events, selection forces, migration patterns)• Potential applications in personalized medicine• Finding new associations (through admixture mapping)

55

Page 55: Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Admixture MappingAdmixture Mapping