Imputation 2

1

Imputation 2

Presenter: Ka-Kit Lam

2

Outline

• Big Picture and Motivation• IMPUTE• IMPUTE2• Experiments• Conclusion and Discussion• Supplementary : – GWAS– Estimate on mutation rate

3

Big Picture and Motivation

4

• Genome-wide association study: – Identify common genetic factors that influence

health/disease

Background

5

Background

• Important to know the SNPs• However, . . . ,– Not all SNPs are genotyped for all individuals in the

case-control study in GWAS.

• How can we guess the missing parts?

Individual 1: ACCCAATTACCAGTATTTA…Individual 2: CCCCATTTACCACTATTTA…Individual 3: ACCCATTTACCACTATTTA…Individual 4: CCCCATTTACCAGTATTTA…

?

?

??

?

6

Information known

• Luckily, we now have references for human DNA:

• But, how can we use the reference genomes?

7

Main Question

• Objective:– Design algorithms • to impute the missing genotypes of the individuals

being studied

– Criteria for algorithms• Scalable• Accurate

8

Big Picture on Algorithm Design

Algorithms

SNPs in study,reference haplotype/genotype

Imputed genotype,associated confidence

1. Scalability2. Accuracy

1. Experimental validation2. Application

In theory, it makes sense In practice, it works

9

IMPUTE

10

Notations and Setting

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Reference Haplotypes :

Genotype in the study sample: L

L

N

K

(Rmk: 0-00 , 1-01, 2-11)

11

Formulation

• Observed genotype and missing genotype

• Classical inference problem:– A reasonable estimate:

– Confidence:

12

Modeling (HMM model):Relationship btw (H,G)

• Assumptions:– Study individuals are independent

– Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model

– Mutation at different sites are conditionally independent given the copied haplotype

13


0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1


L

N

Study Individual:

0 2 2 2 0 0 2 2 0 0 0 1 0 2 1

14


0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

L

N

……

0 2 2 2 0 0 2 2 0 0 0 1 0 2 1

15

Modeling (Transition Probability)

• States• Transition

• What is the intuition?

16

Modeling :relationship btw transition Probability and Recombination

• Recombination Process:

17

Modeling :relationship btw transition Probability and Recombination

• Recombination Process:– More reference, longer the copy length

– Copy length in our model depends on genetic distance btw SNPs

Ref panel 1 Ref panel 2

Study individual:

More likely to have longercopy length here

18

Modeling (Transition Probability)

• States• Transition

19

Modeling (Emission Probability)

• Emission probability– Define mutation rate : – Since mutation is assumed independent across

site 0-00 1-01 2 -11

00 (1-λ)2 2λ(1-λ) (λ)2

01 λ(1-λ) (λ)2+(1-λ)2 λ(1-λ)

11 (λ)2 2λ(1-λ) (1-λ)2

20

Extension (completely missing)• Problem: – Missing genotype across all references and study

samples. How to impute?• What can we expect? – Generate information from no information? – We cannot expect to know the genotype– But we can guess the relationship btw them

– Our friend : population genetics may help !

0 0 1 0 1 1 0 ? 1 1 0 0 0 0 0

0 0 1 0 1 1 0 ? 1 1 0 0 0 0 10 0 1 0 1 1 0 ? 1 1 0 0 0 1 1

21

Imputation on Reference

• IllustrationH(1) 1 1 1 0 0 1 ? 0 0 0 1 0 1 0

H(2) 1 1 1 0 1 0 ? 1 1 0 0 0 1 0

H (3) 1 1 1 0 0 0 ? 0 0 0 1 1 1 1

H (4) 1 1 1 1 0 0 ? 0 0 0 0 0 0 0

H(N) 1 1 1 0 1 1 ? 0 0 1 1 1 0 0

0

0

1

0

1

22

Imputation on Reference

Algorithm:1. Randomly select an ordering2. Sample the first mutation according to

3. Treat previous as references and impute 4. Repeat several time to get a stable output5. Use the imputed reference to impute the study

23

Computational Complexity:Imputation

……

O(N2L) for each individual

24

Computational Complexity:Imputation

O(N2L) for each individual

25

Computational Complexity:Forward-Backward Algorithm

• Forward Equations:

• Naïve application takes O(N4)

26


• Q : How to compute the following in O(N2) ?

• A: (suggested in fastPhase)

27


• Finally, we have

• Similarly for the backward part

O(N2)

O(N) for each jO(N2) totally

O(N) for each iO(N2) totally

O(N2) totally

28

Demo./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000

29

Demo

30

IMPUTE2

31

Motivation

• Accuracy:– Not all information used during imputation (e.g.

other study individuals)• Complexity: – Need to scale well if we incorporate all information

(e.g. previously it is O(LN2))• New data type:– Diploid reference (1000 genome project)

• Q: How to design algorithms to handle this?

32

Description of Setting(Scenario A)

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?


Genotype in the inference panel: L

L

Nhap

Ninf

(Rmk: 0-00 , 1-01, 2-11) :T, :U (Rmk : sets of index of SNPs)

33

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Description of Setting(Scenario B)

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?


L

L

Nhap

(Rmk: 0-00 , 1-01, 2-11) :T, :U1 , :U2

Inference panel

Diploid reference panel

Ninf

Ndip

(Rmk : sets of index of SNPs)

1

1

1

2

2

1

1

0

2

1

34

Algorithm for Scenario A

• Illustration:0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

35


• Illustration (Burn in)

00

? ? ? ? ? 11

? ? ? ? ? 00

? ?

10

? ? ? ? ? 10

? ? ? ? ? 00

? ?

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

36


• Illustration (Phasing)

00

? ? ? ? ? 11

? ? ? ? ? 00

? ?

10

? ? ? ? ? 10

? ? ? ? ? 00

? ?

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

??

? ? ? ? ? ??

? ? ? ? ? ??

? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?0 ? ? ? ? ? 0 ? ? ? ? ? 0 ? ?Update i

(1) (0) (1)(genotype)

37


• Illustration (Imputing)

00

? ? ? ? ? 11

? ? ? ? ? 00

? ?

10

? ? ? ? ? 10

? ? ? ? ? 00

? ?

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ?

??

? ?

? ?

? ?

00

? ?

? ?

? ?

? ?

? ?

10

? ?

? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

Update i

(1) (0) (1)(genotype)

1 1 0 1 0 0 0 0 0 0 0 1 1 1 10 1 1 1 0 1 0 1 1 1 0 0 0 1 0

38

Phasing Step: Path Sampling

• How to sample path?……

39

Imputation Step: Extract Posterior Probability

• After many rounds, we can get : – For each individual and for each missing site

– Assuming independence in sampling the haploid pair

Hap 10 10.3 0.70.2 0.8… …

Hap 20 10.1 0.90.4 0.6… …

Genotype0 1 20.03 0.34 0.63

0.08 0.44 0.48

… … …

Take average then

40

Algorithm for Scenario A:Complexity Analysis

• A) Burn in phase• B) MCMC iterations for m times:– For each individual i• i) phase(i,T,hap+inf)• ii) impute(i,T+U,hap)• iii) record(posterior probability)

• C) Average over different runs of MCMC to get the genotype and confidence

O((Nhap + Ninf)2LT)

O(NhapLT+U)

O(LT+U)

41

Benefits of the Algorithm

• Faster:– Reducing the load in the imputation step

• More accurate:– Utilize information available to guess

42

Algorithm for Scenario B

• Illustration:0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Nhap

Ninf

Ndip

:T, :U1 , :U2

0 ? ? 2 ? ? 2 2 ? ? ? ? 0 ? ?

1 ? ? 2 ? ? 1 1 ? ? ? ? 0 ? ?

43


• Illustration: (Burn in )0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

Nhap

Ninf

Ndip

:T, :U1 , :U2

00

? ? 11

? ? 11

11

? ? ? ? 00

? ?

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

44


• Illustration: (Phase T and U2 in diploid ref)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

? ? 11

? ? 11

11

? ? ? ? 00

? ?

??

? ? ??

? ? ??

??

? ? ? ? ? ?

? ?

Nhap

Ninf

NdipUpdate i

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

:T, :U1 , :U2

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

45


• Illustration: (Impute U1 in diploid ref)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

:T, :U1 , :U2

Update i

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

46

11

? ?

? ?

? ?

? ?

? ?

00

? ?

? ?

? ?

? ?

? ?

10

? ?

? ?

??

? ?

? ?

? ?

? ?

? ?

??

? ?

? ?

? ?

? ?

? ?

??

? ?

? ?


• Illustration: (Phase T in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

Update i10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

:T, :U1 , :U2

47

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

??

??

??

??

??

00

??

??

??

??

??

10

??

??


• Illustration: (Impute U2 in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

Update i

:T, :U1 , :U2

10

??

??

11

??

??

00

10

??

??

??

??

10

??

??

48

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

??

??

11

??

??

00

10

??

??

??

??

10

??

??


• Illustration: (Impute U1 in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

Update i

:T, :U1 , :U2

10

11

11

11

00

10

00

10

10

10

00

01

10

11

01

49

Algorithm for Scenario B:Complexity Analysis

• A) Burn in phase• B) MCMC iterations for m times:

– For each individual i in dip:• i) phase(i,T+U2,hap+dip)• ii) impute(i,T+U1,hap)• Iii) record(posterior probability)

– For each individual i in inference :• i) phase(i,T,hap+dip+inf)• ii) impute(i,T+U2,hap+dip)• iii) impute(i,U1, hap)• iv) record(posterior probability)

• C) Average over different runs of MCMC to get the genotype and confidence

O((Nhap + Ninf)2LT+U2)O(NhapLT+U1)O(LT+U1)

O((Nhap + Ndip + Ninf)2LT)O(Nhap+dipLT+U2)

O(LT+U1+U2)

O(NhapLU1)

50

Benefits of the Algorithm

• Able to handle new data type

• Faster and more accurate

51

Further Speeding Up

• Choose k closest neighours in phasing• Need to compute Hamming distance • O(k2L) for HMM but O(NL) for Hamming

distance computation (better than O(N2L) in previous HMM calculation)

• Choose khap closest neighbours in imputation

• Khap >> k is also good (because O(k2) in phasing but O(k) in imputation)

52

Comparison with Beagle

• Weakness of BEAGLE: – Full joint modeling of all individuals– Accuracy decreases when population increases

/number of SNPs increases in the experiments– Less accurate in rare SNPs than IMPUTE2– More memory efficient

• Strength of BEAGLE:– Faster– Better accommodate trio and duos

53

Demo

./impute2 \ -m ./Example/example.chr22.map \ -h ./Example/example.chr22.1kG.haps \ -l ./Example/example.chr22.1kG.legend \ -g ./Example/example.chr22.study.gens \ -strand_g ./Example/example.chr22.study.strand \ -int 20.4e6 20.5e6 \ -Ne 20000 \ -o ./Example/example.chr22.one.phased.impute2

54

Experiments

55

Experiment plans

• Evaluation of the performance of imputation:– Accuracy – Time and space complexity– Comparison with other methods

• Application of imputation– Identification of associated SNPs in GWAS

• Optimizing performance– Effect of multiple reference panels

56

Accuracy and Calibration

• Setting: – Mask the known genotype – Impute using IMPUTE– Compare called base with ground-truth– Calling Threshold:

• by genotype• by SNPs

– Measure % missing and % mismatch for different threshold

– Compare the estimated confidence with the experimental confidence

57

Accuracy and Calibration

Message: IMPUTE is reasonably accurate and is well calibrated

%missing

%mismatch

58

Comparison: Accuracy (in general and rare allele)

Message: IMPUTE2 is accurate , especially in rare allele

The more to lower left the better

59

Comparison: Algorithm Complexity(Time and Space Complexity)

Message: IMPUTE2 is not too bad in terms of time and space complexity

Phasing step: shorter LImputation step: linear in N

Multiple MCMC increases time

60

Application 1: Identification of associated SNPs

• Setting:– Uses case and control set to identify the gene

associated with Type II Diabetes– Use filtered genotype and that have MAP > 1%– Evaluate the P-value and plot against the

chromosome position to identify the causal gene• Useful in

1. Identifying SNPs to follow up2. Assessing strength of signal

61

Application 1: Identification of associated SNPs

Message: IMPUTE helps identifying SNPs associated with phenotype

Red: Imputed SNPsBlack: typed SNPs

62

Application 2: Validation of missing data

• Setting:– Some genotype collected are not very reliable– Use imputation to impute the genotype by

assuming it is missing– Call and compare to the original genotype

63

Application 2: Validation of missing data

Message: IMPUTE helps reassuring the confidence of data

AA

BB

AB?

64

Effect of Reference Set

65

Effect of reference set

• Motivation:– Capture low-frequency variants by incorporating

data among populations– Remain computationally efficient

• Setting:– Pearson correlation for accuracy– Varying Khap

– Adding more references

66

Effect of Reference Set

Message: More reference set improves accuracy and IMPUTE2 facilitate this

Improvement get saturated when khap reach a certain threshold

Improvement get saturated when we have enough references

67

Summary

• IMPUTE, IMPUTE2 and their extensions • They attempt to design algorithms for

imputation based on– Population genetics model– HMM computation

• Extensive experiments suggests that IMPUTE2 is reasonably accurate and can make good use of reference data set available for GWAS.

68

Discussion• Parameters in HMM:

– Can they learn the parameters of copying process from the study data through EM algorithm?

• Completely missing SNPs:– Can they use clustering algorithm in imputing completely missing

data?• Trios:

– Can they use different panels to do the imputation?• Speed:

– Can they preprocess the reference to speed up the computation?– Can the ideas of BEAGLE of merging come into place at some part of

pre-HMM computation?

69

Supplementary : GWAS

70

Genetic Architecture

• Why are we interested in imputation?– For GWAS.

• Domain of interest:

71

Case-Control Study and Bayes Factor0 1 2

Cases s0 s1 s2

Control r0 r1 r2

Distribution of prior theta is known

72

Supplementary : Reverse Engineering the per site mutation

probability

73

Review of Population Genetics

• Wright Fisher Model for coalescence :

• Infinite site model for mutation– At every inheritance, there is a probability u of

mutation. And mutation occurs only at a distinct site never happened in history.

2M individualsGenerate next generation by randomly choosing with replacement from the last generation and copy

74

Relationship btw Coalescent Theory and Imputation

• Our question: – Having a sample of N individuals as references– What is the mutation rate(per site) λ btw study

sample and the nearest neighbor in the N references

N referencesNearest neighbor in references

study

Whole population (2M)

75

Estimation of Mutation Rate λ

• Pr(no coalescence between the study and all references in last t generations)

• Average time to coalescence

• Thus, mutation rate is λ

BA

N referencesstudy

Time t

76

Estimation of Mutation Rate λ

• Estimate u

• Estimate λ N references

Time t

t2

t3

t4

λ

77

References• Marchini et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet

(2007) vol. 39 (7) pp. 906-13 • Howie et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association

studies. PLoS Genet (2009) vol. 5 (6) pp. e1000529 • Howie et al. Genotype imputation with thousands of genomes. G3 (Bethesda) (2011) vol. 1 (6) pp. 457-70 • Marchini and Howie. Genotype imputation for genome-wide association studies. Nat Rev Genet (2010) vol. 11 (7) pp. 499-

511 • R. Durrett. Probability Models for DNA Sequence Evolution. Springer, 2nd ed., 2008• N. Li and M. Stephens. Modelling linkage disequilibrium, and identifying recombination hotspots using snp data. Genetics,

165:2213–2233, 2003.

78

Thank you

Documents

Imputation 2