Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering

Workshop in BioinformaticsWorkshop in Bioinformatics

Eran Halperin

The Human Genome The Human Genome ProjectProject

“What we are announcing today is that we have reached a milestone…that is, covering the genome in…a working draft of the human sequence.”

“But our work previously has shown… that having one genetic code is important, but it's not all that useful.”

“I would be willing to make a predication that within 10 years, we will have the potential of offering any of you the opportunity to find out what particular genetic conditions you may be at increased risk for…”

Washington, DCJune, 26, 2000

The Vision of Personalized The Vision of Personalized MedicineMedicine

Genetic and epigenetic variants + Genetic and epigenetic variants + measurable environmental/behavioral factors wouldmeasurable environmental/behavioral factors wouldbe used for a personalized treatment and diagnosisbe used for a personalized treatment and diagnosis

Example: WarfarinExample: Warfarin

An anticoagulant drug, useful in the prevention of thrombosis.

Warfarin was originallyused as rat poison.

Optimal dose variesacross the population

Genetic variants (VKORC1 and CYP2C9) affect the variation of the personalized optimal dose.

Example: WarfarinExample: Warfarin

Association StudiesAssociation Studies

AGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGTCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACAGGTATAGCCTACATGAGATCAACATGAGATCGGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACATGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGTCAGAGCAGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCC

AGAGCAGTCGACATGTATAGTCTACATGAGATCGACATGAGATCGGTAGAGCAGTGAGATCAACATGATAGCCAGAGCAGTCGACATGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCAGTCGACATGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGTCAGAGCCGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCAACATGATAGCCAGAGCAGTCGACAGGTATAGTCTACATGAGATCGACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGCCTACATGAGATCGACATGAGATCTGTAGAGCCGTGAGATCGACATGATAGCCAGAGCCGTCGACAGGTATAGTCTACATGAGATCAACATGAGATCTGTAGAGCAGTGAGATCGACATGATAGTC

Cases:

Controls: Associated SNP

Where should we look?Where should we look?SNP = Single Nucleotide PolymorphismUsually SNPs are bi-allelic

Published Genome-Wide Associations through 6/2009, 439 published GWA at p < 5 x 10-8

NHGRI GWA Catalogwww.genome.gov/GWAStudies

Environmental Factors

Genetic Factors

Complexdisease

Multiple genes may affect the disease.

Therefore, the effect of every single gene may be negligible.

How does it work?How does it work?

• For every pair of SNPs we can construct a contingency table:

A G Total

Cases a b n

Controls

c d n

€

n = a+b = c + d

p1 = a /n

p2 = c /n

p =p1 + p2

2

T =n(p1 − p2)2

p(1 − p)

Results: Manhattan PlotsResults: Manhattan Plots

The curse of dimensionality – The curse of dimensionality – corrections of multiple testingcorrections of multiple testing

• In a typical Genome-Wide Association Study (GWAS), we test millions of SNPs.

• If we set the p-value threshold for each test to be 0.05, by chance we will “find” about 5% of the SNPs to be associated with the disease.

• This needs to be corrected.

Bonferroni CorrectionBonferroni Correction

• If the number of tests is n, we set the threshold to be 0.05/n.

• A very conservative test. If the tests are independent then it is reasonable to use it. If the tests are correlated this could be bad:– Example: If all SNPs are identical, then we

lose a lot of power; the false positive rate reduces, but so does the power.

DataData

HUJI 2006

International consortium that aims in genotyping the genome of 270 individuals from four different populations.

HUJI 2006

- Launched in 2002.- First phase (2005):

~1 million SNPs for 270 individuals from four populations- Second phase (2007):

~3.1 million SNPs for 270 individuals from four populations- Third phase (ongoing):

> 1 million SNPs for 1115 individuals across 11 populations

Other Data SourcesOther Data Sources

• Human Genome Diversity Project – 50 populations, 1000 individuals, 650k SNPs

• POPRES– 6000 individuals (controls)

• Encode Project– Resequencing, discovery of new SNPs

• 1000 Genomes project• dbGAP

HaplotypesHaplotypes

HaplotypesHaplotypes

• Can 1,000,000 SNPs tell us everything?

• No, but they can still tell us a lot about the rest of the genome.– SNPs in physical proximity are correlated. – A sequence of alleles along a

chromosome are called haplotypes.

Haplotype Data in a BlockHaplotype Data in a Block

(Daly et al., 2001) Block 6 from Chromosome 5q31

LD structureLD structure

Phasing - haplotype Phasing - haplotype inferenceinference

• Cost effective genotyping technology gives genotypes and not haplotypes.

Haplotypes Genotype

A

CCG

A

C

G

TA

ATCCGAAGACGC

ATACGAAGCCGC

Possiblephases:

AGACGAATCCGC ….

mother chromosomefather chromosome

25

1??11?1??11?

?100???100??

1?0???1?0???

10?11?11?11?

1100??0100??

100???110???

1??11?1??11?

1100??0100??

1?0???1?0???

10011?11111?

11000?01001?

10011?11000?

Inferring Haplotypes From Inferring Haplotypes From TriosTrios

Parent 1

Parent 2

Child

122112

210022

120222

Assumption: No recombination

Population Substructure

• Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to

be found– The vast majority of them are false;Why ???

Different evolutionary forces: drift, selection, mutation, migration, population bottleneck.

Natural SelectionNatural Selection

• Example: being lactose telorant is advantageous in northern Europe, hence there is positive selection in the LCT gene

different allele frequencies in LCT

Genetic Drift

• Even without selection, the allele frequencies in the population are not fixed across time.

• Consider the following case:

– We assume Hardy-Weinberg Equilibrium (HWE), that is, individuals are mating randomly in the population.

– We assume a constant population size, no mutation, no selection

Genetic Drift: The Wright-Fisher Genetic Drift: The Wright-Fisher ModelModel

Generation 1Allele frequency 1/9









Ancestral population


migration


Genetic drift

different allele frequencies

Population Substructure

• Imagine that all the cases are collected from Africa, and all the controls are from Europe. – Many association signals are going to

be found– The vast majority of them are false;

What can we do about it?

Jakobsson et al, Nature 421: 998-103

Principal Component Principal Component AnalysisAnalysis

• Dimensionality reduction• Based on linear algebra• Intuition: find the ‘most important’

features of the data


Plotting the data on a onedimensional line for which the ‘spread’ is maximized.


• In our case, we want to look at two dimensions at a time.

• The original data has many dimensions – each SNP corresponds to one dimension.

HapMap PopulationsHapMap Populations

43

CEUCEU

ASWASW

CHBCHBCHD

CHD

GIHGIH

JPTJPT

LWK

LWK

MEX

MEX

MKK

MKK

TSITSI

YRIYRI

HapMap PCA 1-2HapMap PCA 1-2

44

HapMap PCA 1-3HapMap PCA 1-3

45

HapMap PCA 1,2,4HapMap PCA 1,2,4

46

Ancestry Inference:Ancestry Inference:• To what extent can population structure be detected

from SNP data? • What can we learn from these inferences?

Novembre et al., 2008

Ancestry inference in recently admixed Ancestry inference in recently admixed populationspopulations

100%

0%

20%

40%

60%

80%

1 4 7

10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88Percentracial

admixtureIndividual subjects 1-90

Puerto Rican Population (GALA study, E. Burchard)

European

Native American

African

Recombination EventsRecombination Events

Copy 1

Copy 2

child chromosome

Probability ri for recombinationin position i.

Recently Admixed PopulationsRecently Admixed Populations

AfterAfter generation 1generation 1


AfterAfter generationgeneration 22


After generation 10After generation 10

Chromosome

W Recombination Indicators g Generations

Z Ancestral states r Recombination rate

X Alleles α Admixture fraction

p,q Allele frequencies

Overall AccuracyOverall Accuracy

Applications:• Population genetics (admixture events, recombination

events, selection forces, migration patterns)• Potential applications in personalized medicine• Finding new associations (through admixture mapping)

55

Admixture MappingAdmixture Mapping

Documents

Workshop in Bioinformatics Eran Halperin. The Human Genome Project “What we are announcing today is that we have reached a milestone…that is, covering