78
Imputation 2 Presenter: Ka-Kit Lam 1

Imputation 2

  • Upload
    china

  • View
    61

  • Download
    1

Embed Size (px)

DESCRIPTION

Imputation 2. Presenter: Ka -Kit Lam. Outline. Big Picture and Motivation IMPUTE IMPUTE2 Experiments Conclusion and Discussion Supplementary : GWAS Estimate on mutation rate . Big Picture and Motivation. Background. Genome-wide association study: - PowerPoint PPT Presentation

Citation preview

Page 1: Imputation 2

1

Imputation 2

Presenter: Ka-Kit Lam

Page 2: Imputation 2

2

Outline

• Big Picture and Motivation• IMPUTE• IMPUTE2• Experiments• Conclusion and Discussion• Supplementary : – GWAS– Estimate on mutation rate

Page 3: Imputation 2

3

Big Picture and Motivation

Page 4: Imputation 2

4

• Genome-wide association study: – Identify common genetic factors that influence

health/disease

Background

Page 5: Imputation 2

5

Background

• Important to know the SNPs• However, . . . ,– Not all SNPs are genotyped for all individuals in the

case-control study in GWAS.

• How can we guess the missing parts?

Individual 1: ACCCAATTACCAGTATTTA…Individual 2: CCCCATTTACCACTATTTA…Individual 3: ACCCATTTACCACTATTTA…Individual 4: CCCCATTTACCAGTATTTA…

?

?

??

?

Page 6: Imputation 2

6

Information known

• Luckily, we now have references for human DNA:

• But, how can we use the reference genomes?

Page 7: Imputation 2

7

Main Question

• Objective:– Design algorithms • to impute the missing genotypes of the individuals

being studied

– Criteria for algorithms• Scalable• Accurate

Page 8: Imputation 2

8

Big Picture on Algorithm Design

Algorithms

SNPs in study,reference haplotype/genotype

Imputed genotype,associated confidence

1. Scalability2. Accuracy

1. Experimental validation2. Application

In theory, it makes sense In practice, it works

Page 9: Imputation 2

9

IMPUTE

Page 10: Imputation 2

10

Notations and Setting

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Reference Haplotypes :

Genotype in the study sample: L

L

N

K

(Rmk: 0-00 , 1-01, 2-11)

Page 11: Imputation 2

11

Formulation

• Observed genotype and missing genotype

• Classical inference problem:– A reasonable estimate:

– Confidence:

Page 12: Imputation 2

12

Modeling (HMM model):Relationship btw (H,G)

• Assumptions:– Study individuals are independent

– Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model

– Mutation at different sites are conditionally independent given the copied haplotype

Page 13: Imputation 2

13

Modeling (HMM model):Relationship btw (H,G)

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

Reference Haplotypes :

L

N

Study Individual:

0 2 2 2 0 0 2 2 0 0 0 1 0 2 1

Page 14: Imputation 2

14

Modeling (HMM model):Relationship btw (H,G)

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

L

N

……

0 2 2 2 0 0 2 2 0 0 0 1 0 2 1

Page 15: Imputation 2

15

Modeling (Transition Probability)

• States• Transition

• What is the intuition?

Page 16: Imputation 2

16

Modeling :relationship btw transition Probability and Recombination

• Recombination Process:

Page 17: Imputation 2

17

Modeling :relationship btw transition Probability and Recombination

• Recombination Process:– More reference, longer the copy length

– Copy length in our model depends on genetic distance btw SNPs

Ref panel 1 Ref panel 2

Study individual:

More likely to have longercopy length here

Page 18: Imputation 2

18

Modeling (Transition Probability)

• States• Transition

Page 19: Imputation 2

19

Modeling (Emission Probability)

• Emission probability– Define mutation rate : – Since mutation is assumed independent across

site 0-00 1-01 2 -11

00 (1-λ)2 2λ(1-λ) (λ)2

01 λ(1-λ) (λ)2+(1-λ)2 λ(1-λ)

11 (λ)2 2λ(1-λ) (1-λ)2

Page 20: Imputation 2

20

Extension (completely missing)• Problem: – Missing genotype across all references and study

samples. How to impute?• What can we expect? – Generate information from no information? – We cannot expect to know the genotype– But we can guess the relationship btw them

– Our friend : population genetics may help !

0 0 1 0 1 1 0 ? 1 1 0 0 0 0 0

0 0 1 0 1 1 0 ? 1 1 0 0 0 0 10 0 1 0 1 1 0 ? 1 1 0 0 0 1 1

Page 21: Imputation 2

21

Imputation on Reference

• IllustrationH(1) 1 1 1 0 0 1 ? 0 0 0 1 0 1 0

H(2) 1 1 1 0 1 0 ? 1 1 0 0 0 1 0

H (3) 1 1 1 0 0 0 ? 0 0 0 1 1 1 1

H (4) 1 1 1 1 0 0 ? 0 0 0 0 0 0 0

H(N) 1 1 1 0 1 1 ? 0 0 1 1 1 0 0

0

0

1

0

1

Page 22: Imputation 2

22

Imputation on Reference

Algorithm:1. Randomly select an ordering2. Sample the first mutation according to

3. Treat previous as references and impute 4. Repeat several time to get a stable output5. Use the imputed reference to impute the study

Page 23: Imputation 2

23

Computational Complexity:Imputation

……

O(N2L) for each individual

Page 24: Imputation 2

24

Computational Complexity:Imputation

O(N2L) for each individual

Page 25: Imputation 2

25

Computational Complexity:Forward-Backward Algorithm

• Forward Equations:

• Naïve application takes O(N4)

Page 26: Imputation 2

26

Computational Complexity:Forward-Backward Algorithm

• Q : How to compute the following in O(N2) ?

• A: (suggested in fastPhase)

Page 27: Imputation 2

27

Computational Complexity:Forward-Backward Algorithm

• Finally, we have

• Similarly for the backward part

O(N2)

O(N) for each jO(N2) totally

O(N) for each iO(N2) totally

O(N2) totally

Page 28: Imputation 2

28

Demo./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000

Page 29: Imputation 2

29

Demo

Page 30: Imputation 2

30

IMPUTE2

Page 31: Imputation 2

31

Motivation

• Accuracy:– Not all information used during imputation (e.g.

other study individuals)• Complexity: – Need to scale well if we incorporate all information

(e.g. previously it is O(LN2))• New data type:– Diploid reference (1000 genome project)

• Q: How to design algorithms to handle this?

Page 32: Imputation 2

32

Description of Setting(Scenario A)

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Reference Haplotypes :

Genotype in the inference panel: L

L

Nhap

Ninf

(Rmk: 0-00 , 1-01, 2-11) :T, :U (Rmk : sets of index of SNPs)

Page 33: Imputation 2

33

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Description of Setting(Scenario B)

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

Reference Haplotypes :

L

L

Nhap

(Rmk: 0-00 , 1-01, 2-11) :T, :U1 , :U2

Inference panel

Diploid reference panel

Ninf

Ndip

(Rmk : sets of index of SNPs)

1

1

1

2

2

1

1

0

2

1

Page 34: Imputation 2

34

Algorithm for Scenario A

• Illustration:0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

0 ? ? ? ? ? 2 ? ? ? ? ? 0 ? ?

1 ? ? ? ? ? 1 ? ? ? ? ? ? ? ?

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Page 35: Imputation 2

35

Algorithm for Scenario A

• Illustration (Burn in)

00

? ? ? ? ? 11

? ? ? ? ? 00

? ?

10

? ? ? ? ? 10

? ? ? ? ? 00

? ?

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

Page 36: Imputation 2

36

Algorithm for Scenario A

• Illustration (Phasing)

00

? ? ? ? ? 11

? ? ? ? ? 00

? ?

10

? ? ? ? ? 10

? ? ? ? ? 00

? ?

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

??

? ? ? ? ? ??

? ? ? ? ? ??

? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?0 ? ? ? ? ? 0 ? ? ? ? ? 0 ? ?Update i

(1) (0) (1)(genotype)

Page 37: Imputation 2

37

Algorithm for Scenario A

• Illustration (Imputing)

00

? ? ? ? ? 11

? ? ? ? ? 00

? ?

10

? ? ? ? ? 10

? ? ? ? ? 00

? ?

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ?

??

? ?

? ?

? ?

00

? ?

? ?

? ?

? ?

? ?

10

? ?

? ?

0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

Update i

(1) (0) (1)(genotype)

1 1 0 1 0 0 0 0 0 0 0 1 1 1 10 1 1 1 0 1 0 1 1 1 0 0 0 1 0

Page 38: Imputation 2

38

Phasing Step: Path Sampling

• How to sample path?……

Page 39: Imputation 2

39

Imputation Step: Extract Posterior Probability

• After many rounds, we can get : – For each individual and for each missing site

– Assuming independence in sampling the haploid pair

Hap 10 10.3 0.70.2 0.8… …

Hap 20 10.1 0.90.4 0.6… …

Genotype0 1 20.03 0.34 0.63

0.08 0.44 0.48

… … …

Take average then

Page 40: Imputation 2

40

Algorithm for Scenario A:Complexity Analysis

• A) Burn in phase• B) MCMC iterations for m times:– For each individual i• i) phase(i,T,hap+inf)• ii) impute(i,T+U,hap)• iii) record(posterior probability)

• C) Average over different runs of MCMC to get the genotype and confidence

O((Nhap + Ninf)2LT)

O(NhapLT+U)

O(LT+U)

Page 41: Imputation 2

41

Benefits of the Algorithm

• Faster:– Reducing the load in the imputation step

• More accurate:– Utilize information available to guess

Page 42: Imputation 2

42

Algorithm for Scenario B

• Illustration:0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

2 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

1 ? ? ? ? ? 0 ? ? ? ? ? 1 ? ?

Nhap

Ninf

Ndip

:T, :U1 , :U2

0 ? ? 2 ? ? 2 2 ? ? ? ? 0 ? ?

1 ? ? 2 ? ? 1 1 ? ? ? ? 0 ? ?

Page 43: Imputation 2

43

Algorithm for Scenario B

• Illustration: (Burn in )0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

Nhap

Ninf

Ndip

:T, :U1 , :U2

00

? ? 11

? ? 11

11

? ? ? ? 00

? ?

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Page 44: Imputation 2

44

Algorithm for Scenario B

• Illustration: (Phase T and U2 in diploid ref)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

? ? 11

? ? 11

11

? ? ? ? 00

? ?

??

? ? ??

? ? ??

??

? ? ? ? ? ?

? ?

Nhap

Ninf

NdipUpdate i

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

:T, :U1 , :U2

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

Page 45: Imputation 2

45

Algorithm for Scenario B

• Illustration: (Impute U1 in diploid ref)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

:T, :U1 , :U2

Update i

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

Page 46: Imputation 2

46

11

? ?

? ?

? ?

? ?

? ?

00

? ?

? ?

? ?

? ?

? ?

10

? ?

? ?

??

? ?

? ?

? ?

? ?

? ?

??

? ?

? ?

? ?

? ?

? ?

??

? ?

? ?

Algorithm for Scenario B

• Illustration: (Phase T in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

Update i10

? ? ? ? ? 00

? ? ? ? ? 10

? ?

:T, :U1 , :U2

Page 47: Imputation 2

47

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

??

??

??

??

??

00

??

??

??

??

??

10

??

??

Algorithm for Scenario B

• Illustration: (Impute U2 in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

Update i

:T, :U1 , :U2

10

??

??

11

??

??

00

10

??

??

??

??

10

??

??

Page 48: Imputation 2

48

11

? ? ? ? ? 00

? ? ? ? ? 10

? ?

10

??

??

11

??

??

00

10

??

??

??

??

10

??

??

Algorithm for Scenario B

• Illustration: (Impute U1 in inference panel)0 1 1 1 0 0 1 1 0 0 0 1 0 1 0

0 1 1 1 0 1 0 1 1 1 0 0 0 1 0

1 1 1 1 0 0 0 0 0 0 0 1 1 1 1

00

11

11

11

00

01

11

11

01

01

00

11

00

11

00

10

? ? 11

? ? 10

10

? ? ? ? 00

? ?

Nhap

Ninf

Ndip10

11

11

11

00

00

10

10

00

00

00

10

00

11

00

Update i

:T, :U1 , :U2

10

11

11

11

00

10

00

10

10

10

00

01

10

11

01

Page 49: Imputation 2

49

Algorithm for Scenario B:Complexity Analysis

• A) Burn in phase• B) MCMC iterations for m times:

– For each individual i in dip:• i) phase(i,T+U2,hap+dip)• ii) impute(i,T+U1,hap)• Iii) record(posterior probability)

– For each individual i in inference :• i) phase(i,T,hap+dip+inf)• ii) impute(i,T+U2,hap+dip)• iii) impute(i,U1, hap)• iv) record(posterior probability)

• C) Average over different runs of MCMC to get the genotype and confidence

O((Nhap + Ninf)2LT+U2)O(NhapLT+U1)O(LT+U1)

O((Nhap + Ndip + Ninf)2LT)O(Nhap+dipLT+U2)

O(LT+U1+U2)

O(NhapLU1)

Page 50: Imputation 2

50

Benefits of the Algorithm

• Able to handle new data type

• Faster and more accurate

Page 51: Imputation 2

51

Further Speeding Up

• Choose k closest neighours in phasing• Need to compute Hamming distance • O(k2L) for HMM but O(NL) for Hamming

distance computation (better than O(N2L) in previous HMM calculation)

• Choose khap closest neighbours in imputation

• Khap >> k is also good (because O(k2) in phasing but O(k) in imputation)

Page 52: Imputation 2

52

Comparison with Beagle

• Weakness of BEAGLE: – Full joint modeling of all individuals– Accuracy decreases when population increases

/number of SNPs increases in the experiments– Less accurate in rare SNPs than IMPUTE2– More memory efficient

• Strength of BEAGLE:– Faster– Better accommodate trio and duos

Page 53: Imputation 2

53

Demo

./impute2 \ -m ./Example/example.chr22.map \ -h ./Example/example.chr22.1kG.haps \ -l ./Example/example.chr22.1kG.legend \ -g ./Example/example.chr22.study.gens \ -strand_g ./Example/example.chr22.study.strand \ -int 20.4e6 20.5e6 \ -Ne 20000 \ -o ./Example/example.chr22.one.phased.impute2

Page 54: Imputation 2

54

Experiments

Page 55: Imputation 2

55

Experiment plans

• Evaluation of the performance of imputation:– Accuracy – Time and space complexity– Comparison with other methods

• Application of imputation– Identification of associated SNPs in GWAS

• Optimizing performance– Effect of multiple reference panels

Page 56: Imputation 2

56

Accuracy and Calibration

• Setting: – Mask the known genotype – Impute using IMPUTE– Compare called base with ground-truth– Calling Threshold:

• by genotype• by SNPs

– Measure % missing and % mismatch for different threshold

– Compare the estimated confidence with the experimental confidence

Page 57: Imputation 2

57

Accuracy and Calibration

Message: IMPUTE is reasonably accurate and is well calibrated

%missing

%mismatch

Page 58: Imputation 2

58

Comparison: Accuracy (in general and rare allele)

Message: IMPUTE2 is accurate , especially in rare allele

The more to lower left the better

Page 59: Imputation 2

59

Comparison: Algorithm Complexity(Time and Space Complexity)

Message: IMPUTE2 is not too bad in terms of time and space complexity

Phasing step: shorter LImputation step: linear in N

Multiple MCMC increases time

Page 60: Imputation 2

60

Application 1: Identification of associated SNPs

• Setting:– Uses case and control set to identify the gene

associated with Type II Diabetes– Use filtered genotype and that have MAP > 1%– Evaluate the P-value and plot against the

chromosome position to identify the causal gene• Useful in

1. Identifying SNPs to follow up2. Assessing strength of signal

Page 61: Imputation 2

61

Application 1: Identification of associated SNPs

Message: IMPUTE helps identifying SNPs associated with phenotype

Red: Imputed SNPsBlack: typed SNPs

Page 62: Imputation 2

62

Application 2: Validation of missing data

• Setting:– Some genotype collected are not very reliable– Use imputation to impute the genotype by

assuming it is missing– Call and compare to the original genotype

Page 63: Imputation 2

63

Application 2: Validation of missing data

Message: IMPUTE helps reassuring the confidence of data

AA

BB

AB?

Page 64: Imputation 2

64

Effect of Reference Set

Page 65: Imputation 2

65

Effect of reference set

• Motivation:– Capture low-frequency variants by incorporating

data among populations– Remain computationally efficient

• Setting:– Pearson correlation for accuracy– Varying Khap

– Adding more references

Page 66: Imputation 2

66

Effect of Reference Set

Message: More reference set improves accuracy and IMPUTE2 facilitate this

Improvement get saturated when khap reach a certain threshold

Improvement get saturated when we have enough references

Page 67: Imputation 2

67

Summary

• IMPUTE, IMPUTE2 and their extensions • They attempt to design algorithms for

imputation based on– Population genetics model– HMM computation

• Extensive experiments suggests that IMPUTE2 is reasonably accurate and can make good use of reference data set available for GWAS.

Page 68: Imputation 2

68

Discussion• Parameters in HMM:

– Can they learn the parameters of copying process from the study data through EM algorithm?

• Completely missing SNPs:– Can they use clustering algorithm in imputing completely missing

data?• Trios:

– Can they use different panels to do the imputation?• Speed:

– Can they preprocess the reference to speed up the computation?– Can the ideas of BEAGLE of merging come into place at some part of

pre-HMM computation?

Page 69: Imputation 2

69

Supplementary : GWAS

Page 70: Imputation 2

70

Genetic Architecture

• Why are we interested in imputation?– For GWAS.

• Domain of interest:

Page 71: Imputation 2

71

Case-Control Study and Bayes Factor0 1 2

Cases s0 s1 s2

Control r0 r1 r2

Distribution of prior theta is known

Page 72: Imputation 2

72

Supplementary : Reverse Engineering the per site mutation

probability

Page 73: Imputation 2

73

Review of Population Genetics

• Wright Fisher Model for coalescence :

• Infinite site model for mutation– At every inheritance, there is a probability u of

mutation. And mutation occurs only at a distinct site never happened in history.

2M individualsGenerate next generation by randomly choosing with replacement from the last generation and copy

Page 74: Imputation 2

74

Relationship btw Coalescent Theory and Imputation

• Our question: – Having a sample of N individuals as references– What is the mutation rate(per site) λ btw study

sample and the nearest neighbor in the N references

N referencesNearest neighbor in references

study

Whole population (2M)

Page 75: Imputation 2

75

Estimation of Mutation Rate λ

• Pr(no coalescence between the study and all references in last t generations)

• Average time to coalescence

• Thus, mutation rate is λ

BA

N referencesstudy

Time t

Page 76: Imputation 2

76

Estimation of Mutation Rate λ

• Estimate u

• Estimate λ N references

Time t

t2

t3

t4

λ

Page 77: Imputation 2

77

References• Marchini et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet

(2007) vol. 39 (7) pp. 906-13 • Howie et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association

studies. PLoS Genet (2009) vol. 5 (6) pp. e1000529 • Howie et al. Genotype imputation with thousands of genomes. G3 (Bethesda) (2011) vol. 1 (6) pp. 457-70 • Marchini and Howie. Genotype imputation for genome-wide association studies. Nat Rev Genet (2010) vol. 11 (7) pp. 499-

511 • R. Durrett. Probability Models for DNA Sequence Evolution. Springer, 2nd ed., 2008• N. Li and M. Stephens. Modelling linkage disequilibrium, and identifying recombination hotspots using snp data. Genetics,

165:2213–2233, 2003.

Page 78: Imputation 2

78

Thank you