50
Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut

Scalable Algorithms for Analysis of Genomic Diversity Data

  • Upload
    eris

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Scalable Algorithms for Analysis of Genomic Diversity Data. Bogdan Paşaniuc. Department of Computer Science & Engineering University of Connecticut. Single Nucleotide Polymorphisms. Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) - PowerPoint PPT Presentation

Citation preview

Page 1: Scalable Algorithms for Analysis of Genomic Diversity Data

Scalable Algorithms for Analysis of Genomic Diversity Data

Bogdan Paşaniuc

Department of Computer Science & EngineeringUniversity of Connecticut

Page 2: Scalable Algorithms for Analysis of Genomic Diversity Data

Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs)

High density in the human genome: 1x107 out of 3109 base pairs

Vast majority bi-allelic 0/1 encoding

Single Nucleotide Polymorphisms

… ataggtccCtatttcgcgcCgtatacacgggActata …… ataggtccGtatttcgcgcCgtatacacgggTctata …… ataggtccCtatttcgcgcCgtatacacgggTctata …

Page 3: Scalable Algorithms for Analysis of Genomic Diversity Data

Haplotypes and Genotypes

Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor

Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father

Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor)

allele; 2 - the chromosomes contain different alleles

021200210011000110001100010

+ two haplotypes per individual

genotype

Page 4: Scalable Algorithms for Analysis of Genomic Diversity Data

Introduction

Haplotype data exact DNA sequence function

Haplotypes increased power of association

Directly determining haplotype data is expensive and time consuming

Cost effective high-throughput technologies to determine genotype data

Need for computational methods for inferring haplotypes from genotype data: genotype phasing problem

Page 5: Scalable Algorithms for Analysis of Genomic Diversity Data

Outline

Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype

Diversity Genotype Imputation DNA Barcoding Conclusions

Page 6: Scalable Algorithms for Analysis of Genomic Diversity Data

Genotype Phasing

For a genotype with k 2’s there are 2k-1 possible pairs of haplotypes explaining it

g: 0010212 ?

h1:0010111

h2:0010010

h3:0010011

h4:0010110

Computational approaches to genotype phasing Statistical methods: PHASE, Phamily, PL, GERBIL … Combinatorial methods: Parsimony, HAP, 2SNP, ENT …

Page 7: Scalable Algorithms for Analysis of Genomic Diversity Data

Minimum Entropy Genotype Phasing

Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g

Coverage of h in f – number of times h appears in the image of f Entropy of a phasing:

)||2

),cov(log(

||2

),cov()(

0),cov(: G

fh

G

fhfEntropy

fhh

Minimum Entropy Genotype Phasing [Halperin&Karp 04]: Given a set of genotypes, find a phasing with minimum entropy

Page 8: Scalable Algorithms for Analysis of Genomic Diversity Data

ENT Algorithm

InitializationStart with random phasing

Iterative improvement stepWhile there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it

Min Entropy Objective is uninformative for long genotypes each haplotype compatible with 1 genotype all haplotypes have

coverage of 1 entropy of all phasings = -log(1/2G)

Page 9: Scalable Algorithms for Analysis of Genomic Diversity Data

Overlapping Window approach

Entropy is computed over short windows of size l+f l “locked” SNPs previously phased f “free” SNPs are currently phased

locked free

…4321

g1

gn

Only phasings consistent with the l locked SNPs are considered

Page 10: Scalable Algorithms for Analysis of Genomic Diversity Data

Effect of Window Size

Page 11: Scalable Algorithms for Analysis of Genomic Diversity Data

Experimental setup (1)

International HapMap Project, Phase II datasets 3.7 million SNP loci 3 populations:

CEU, YRI: 30 trios JPT+CHB: 90 unrelated individuals

Reference haplotypes obtained using PHASE Accuracy

Relative Genotype Error (RGE): percentage of missing genotypes inferred differently as reference method

Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs

Page 12: Scalable Algorithms for Analysis of Genomic Diversity Data

Experimental setup (2)

Compared algorithms ENT 2SNP [Brinza&Zelikovsky 05] Pure Parsimony Trio Phasing (ILP)

[Brinza et al. 05] PHASE [Stephens et al 01] HAP [Halperin&Eskin 04] FastPhase [Scheet&Stephens 06]

Page 13: Scalable Algorithms for Analysis of Genomic Diversity Data

Results on HapMap Phase II Panels

Averages over the 22 chromosomes Runtime:

ENT few hours PHASE months of CPU time on cluster of 238

nodes

Page 14: Scalable Algorithms for Analysis of Genomic Diversity Data

Results on [Orzack et al 03] dataset

[Orzack et al. 03] 80 unrelated genotypes over 9 SNPs Haplotypes determined experimentally

Ranking of algorithms remains the same Slight underestimation of true error rate

Page 15: Scalable Algorithms for Analysis of Genomic Diversity Data

Effect of pedigree information

Page 16: Scalable Algorithms for Analysis of Genomic Diversity Data

Outline

Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype

Diversity Genotype Imputation DNA Barcoding Conclusions

Page 17: Scalable Algorithms for Analysis of Genomic Diversity Data

Founder Haplotypes

Haplotypes in the current population arose from small number of founder haplotypes by mutation and recombination events

Obtained using HaploVisual www.cs.helsinki.fi/u/prastas/haplovisual/

Page 18: Scalable Algorithms for Analysis of Genomic Diversity Data

HMM Model

Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]

Models the ancestral haplotype population Paths with high transition probability “founder”

haplotypes Transitions from one founder to other founder

recombination events Emissions mutation events

Page 19: Scalable Algorithms for Analysis of Genomic Diversity Data

HMM Training

Previous works use EM training of HMM based on unrelated genotype data

2-step procedure: 1. Infer haplotypes using ENT

Uses all available pedigree information

2. Baum-Welch training on inferred haplotypes Maximizes the likelihood of the haplotypes

Page 20: Scalable Algorithms for Analysis of Genomic Diversity Data

Maximum Probability Genotype Phasing

Phase G as pair (h1,h2) = argmax P(h1)P(h2) Maximum phasing probability:

How hard is to compute maximum phasing probability in the HMM? Conjectured to be NP-hard [Rastas et al 07]

Theorem Cannot approximate P(G) within O(n1/2 -), unless

ZPP=NP, where n is the number of SNP loci

)()( MAX)( 21 hPhPGP

Page 21: Scalable Algorithms for Analysis of Genomic Diversity Data

Complexity of Computing Maximum Phasing Probability

Reduction from Max Clique

Transitions1,1/2 Initial transition 2deg(v)+1(2)/α All haps prob 1/α

Page 22: Scalable Algorithms for Analysis of Genomic Diversity Data

Complexity of Computing Maximum Phasing Probability

H representing clique of size k will be emitted along k paths P(H) = k/α

By construction H’ (complement of H) can be emitted along second block

G = 22…22 P(G)=max(P(H))2

G has a clique of size k or more iff P(G) ≥ (k/α)2

Maximum probability genotype phasing is NP-hard

Page 23: Scalable Algorithms for Analysis of Genomic Diversity Data

Heuristic Decoding Algorithms

Viterbi Decoding Maximum probability of emitting a haplotype pair that

explain G along two HMM paths Efficiently computed using Viterbi’s algorithm

Posterior Decoding For each SNP choose the states that are most likely at

that locus given the genotype G Find most likely emissions at each SNP to explain G Efficiently computed using forward and backward

algorithm

Sampling from the HMM posterior distribution generate pairs of haplotypes that explain G conditional

on the haplotype distribution represented by the HMM Combine the sample into a single phasing

Page 24: Scalable Algorithms for Analysis of Genomic Diversity Data

Greedy Likelihood Decoding

Uses forward values computed by forward algorithm fh(i,q) = the total probability of emitting the first i alleles

of the haplotype h and ending up at state q at level i. P(H|M)= ∑fh(n,q)

Constructs (h, h’) with (x,y) at SNP i, s.t. the probability of the phasing up to locus i, given the already determined phasing for the first i, is maximized

2 variants: left-to-right or right-to-left

),(),()'|'P( )|P( ']1...1[]1...1[]1...1[]1...1[ qifqifhyhhxhii Qqh

Qqhiiii

Page 25: Scalable Algorithms for Analysis of Genomic Diversity Data

Combined Greedy Likelihood decoding

Left to right phasing Right to left phasing

Combined phasing

SNP i

P(Comb. phasing at SNP i) = ∑ fh(i,q)bh(i,q) x ∑ fh’(i,q)bh’(i,q)

SNP i that gives best improvement found in O(Kn) time given forward and backward values for the 4 haplotypes

hh’

hh’

Page 26: Scalable Algorithms for Analysis of Genomic Diversity Data

Tweaking a Phasing by Local Switching

New phasing obtained by switching at SNP i

P (new phasing) = ∑ fh(i,q) bh(i,q) x ∑ fh’(i,q) bh’(i,q)

SNP i that gives best improvement found in O(Kn) time given forward and backward values for the 2 haplotypes

Iterative 1-OPT procedure

While there exists a SNP that improves the likelihood of the phasing obtained by switching at that SNP, find the SNP that yields the highest increase and perform switching

SNP i

Page 27: Scalable Algorithms for Analysis of Genomic Diversity Data

Experimental Setup

ADHD dataset Chromosome X genotype data from the Genetic Association Information

Network (GAIN) study on Attention Deficit Hyperactivity Disorder (ADHD) 958 parent-child trios from the International Multi-site ADHD Genetics

(IMAGE) project Phased the children as unrelated on a 50 SNP window

Decoding alg.   TweakingViterbi 11.814 11.571

Posterior 26.736 11.940HMM Sampling 15.323 11.826Greedy left to

right 12.154 11.693Greedy right to

left 13.283 12.057Greedy

combined 11.838 11.510Random phasing 50.559 14.764

Method   TweakingENT 13.513 11.705

fastPHASE 12.035 11.231PHASE v

2.1 10.393 11.2192SNP 14.497 11.729

BEAGLE r=1 11.862 11.705

BEAGLE r=4 10.442 11.304

Page 28: Scalable Algorithms for Analysis of Genomic Diversity Data

Outline

Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype

Diversity Genotype Imputation DNA Barcoding Conclusions

Page 29: Scalable Algorithms for Analysis of Genomic Diversity Data

Genome-wide case-control association studies

Preferred method for finding the genetic basis of human diseases

1. Large number of markers (SNPs) typed in cases and controls

2. Statistical test of association disease-correlated locus

Disease causal SNPs unlikely to be typed directly

Limited coverage of current genotyping platforms

Vast number of SNPs present across the human genome

Page 30: Scalable Algorithms for Analysis of Genomic Diversity Data

Genotype Imputation

Imputation of genotypes at un-typed SNP loci Powerful technique for increasing the power of

association studies Typed markers in conjunction with catalogs of

SNP variation (e.g. HapMap) predictors for SNP not present on the array

Challenge: Optimally combining the multi-locus information from current + multi-locus variation from HapMap

Page 31: Scalable Algorithms for Analysis of Genomic Diversity Data

HMM Based Genotype Imputation

1. Integrate the HapMap variation information into the HMM• Train HMM using the haplotypes from the panel

related to the studied population (e.g. CEU panel: Utah residents with ancestry from northern and western Europe)

2. Compute probabilities of missing genotypes given the typed genotype data

• gi is imputed as x, where )|,(argmax }2,1,0{ MxgGPx ix

)|(

)|,(),|(

MGP

MxgGPMGxgP i

i

Page 32: Scalable Algorithms for Analysis of Genomic Diversity Data

Related Problems

Missing Data Recovery Fill in the genotypes uncalled by the

genotyping algorithm

Genotype Error Detection and Correction If gi is present, then the increase in

likelihood obtained by replacing gi with x is:

)|(

)|(

MGP

MGP xgi

Page 33: Scalable Algorithms for Analysis of Genomic Diversity Data

Likelihood Computation

P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths.

Computed in O(nK3) by a 2-path extension of the forward algorithm followed by a factor K speed-up [Rastas07]

),(),(),;1(),;(),;( 2'21

'1

),(

'2

'12121

21

'2

'1

qqqqqqjfqqjEqqjfjQqq

Page 34: Scalable Algorithms for Analysis of Genomic Diversity Data

Experimental Setup

WTCCC Dataset Genotype data of the 1958 birth cohort from the

The Welcome Trust Consortium genome-association study

1,444 individuals from this cohort were typed using both the Affymetrix 500k platform and a custom Illumina 15k platform

Affymetrix data + CEU HapMap haplotypes used to impute genotypes at the SNP loci present of the Illumina chip and not on the Affymetrix chip

The actual Illumina genotypes were then used to estimate the imputation accuracy

Page 35: Scalable Algorithms for Analysis of Genomic Diversity Data

Results

Estimates of the allele 0 frequencies based on Imputation vs. Illumina 15k

Page 36: Scalable Algorithms for Analysis of Genomic Diversity Data

Results

Accuracy and missing data rate for imputed genotypes for different thresholds.

Dashed line = missing data rateSolid line = discordance rate

Page 37: Scalable Algorithms for Analysis of Genomic Diversity Data

Effect of Errors and Missing Data

Added additional 1% genotyping errors and 1% missing genotypes

TP Rate = correctly flagged errors out of total errors insertedFP Rate = incorrectly flagged genotype out of total correct genotypes Error Correction Accuracy = correctly recovered out of flagged ones

  Error Detection Error CorrectionMissing Data

Recovery Imputatio

n

 TP

Rate(%)FP

Rate(%) Accuracy(%) Error Rate(%)Error

Rate(%)EDC+MDR+IM

P 69.46 0.20 97.16 7.62 6.49MDR+IMP - - - 7.72 6.63

IMP - - - - 6.64

Page 38: Scalable Algorithms for Analysis of Genomic Diversity Data

Outline

Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype

Diversity Genotype Imputation DNA Barcoding Conclusions

Page 39: Scalable Algorithms for Analysis of Genomic Diversity Data

DNA barcoding

Recently(2003) proposed by taxonomists as a tool for rapid species identification

Use short DNA region as “fingerprint” for species

Region of choice: cytochrome c oxidase subunit 1

mitochondrial gene ("COI", 648 base pairs long).

Key assumption: Existence of “barcoding gap”

Inter-species variability >> than intra-species

variability

Page 40: Scalable Algorithms for Analysis of Genomic Diversity Data

BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07]

http://www.barcodinglife.org Currently: 38,539 species, 388,582

barcodes

Page 41: Scalable Algorithms for Analysis of Genomic Diversity Data

DNA barcoding challenges

Efficient algorithms for species identification Millions of species

Meaningful confidence measures BOLD identification system showed to have

unclear confidence measures [Ekrem et al.07]:

New species discovery Sample size optimization

#barcodes per species required Barcode length Barcode quality

Number of regions required

Page 42: Scalable Algorithms for Analysis of Genomic Diversity Data

Species identification problem

Several methods proposed for assigning specimens to species

TaxI (Steinke et al.05), Likelihood ratio test (Matz&Nielsen06), BOLD-IDS(Ratnasingham&Hebert 07)…

No direct comparisons on standardized benchmarks This work:

Direct comparison of methods from three main classes

Distance-based, tree-based, and statistical model-based

Explore the effect of repository size #barcodes/species, #species

Given repository containing barcodes from known species and a new barcode find its species

Page 43: Scalable Algorithms for Analysis of Genomic Diversity Data

Methods

Distance-based Hamming distance, Aminoacid Similarity,

Convex Score similarity, Tri-nucleotide frequency distance, Combined method

Tree-based Exemplar NJ [Meyer&Paulay05] Profile NJ [Muller et al 04] Phylogenetic transversal

Statistical model-based Likelihood ratio test [Matz&Nielsen06] PWMs Inhomogeneous Markov Chains

Page 44: Scalable Algorithms for Analysis of Genomic Diversity Data

Inhomogeneous Markov Chain (IMC)

Takes into account dependencies between consecutive loci

start

A

C

T

G

A

C

T

G

A

C

T

G

A

C

T

G

locus 1 locus 2 locus 3 locus 4

1t 2t 3t

),()()|( 1

1

11

ii

n

i

i xxtxMxP

Page 45: Scalable Algorithms for Analysis of Genomic Diversity Data

Comparison of representative methods

  ACG Birds Bats Guyana Fish Australia CowriesMIN-HD 98.81% 97.59% 100.00% 99.30% 88.80%

IMC 95.27% 97.23% 100.00% 99.58% 89.83%Phylo 93.29% 92.33% 98.55% 99.30% 81.00%

Leave one out experiment

Hesperidia of the ACG 1 [Hajibabaei M. et al, 05]: 4267 barcodes, 561 species Birds of North America [Kerr K.C.R. et al, 07]: 2589 barcodes, 656 species Bats of Guyana [Clare E.L. et al, 06]: 840 barcodes, 96 species Fishes of Australia Container Part [Ward et. al, 05]: 754 barcodes, 211 species Cowries [Meyer and Paulay, 05]: 2036 barcodes, 263 species

Page 46: Scalable Algorithms for Analysis of Genomic Diversity Data

Accuracy vs Species size

00.10.20.30.40.50.60.70.80.9

1

0 50 100 1500

0.10.20.30.40.5

0.60.70.80.9

1

0 50 100 150

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150

MIN-HD IMC

Phylo

Page 47: Scalable Algorithms for Analysis of Genomic Diversity Data

Accuracy vs. #Species

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

300 600 900 1200 1500

MIN-HD

IMC

Phylo*

Page 48: Scalable Algorithms for Analysis of Genomic Diversity Data

Conclusions

Highly scalable method for genotype phasing Several orders of magnitude faster than current

methods Phasing accuracy close to the best methods Exploits all pedigree information available

HMM model of haplotype diversity Hardness result for genotype phasing Improved decoding algorithms for phasing Imputation of genotypes at un-typed SNPs

DNA-barcoding Introduced new methods for species identification Comprehensive comparison to existing methods

Page 49: Scalable Algorithms for Analysis of Genomic Diversity Data

Acknowledgments

Prof. Ion Mandoiu

Profs. Sanguthevar Rajasekaran, Alex Russell

Sasha Gusev (Entropy phasing, DNA barcoding)

Justin Kennedy (HMM Imputation and Error detection)

James Lindsay, Sotiris Kentros (DNA barcoding)

Page 50: Scalable Algorithms for Analysis of Genomic Diversity Data

References

Genotype phasing: B. Pasaniuc and I.I. Mandoiu. Highly scalable genotype phasing by entropy minimization. In

Proc. 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 3482-3486, 2006.

A. Gusev, I.I. Mandoiu, and B. Pasaniuc. Highly scalable genotype phasing by entropy minimization. IEEE/ACM Trans. on Computational Biology and Bioinformatics 5, pp. 252-261, 2008.

HMM model, genotype imputation and error detection: J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using hidden Markov

models of haplotype diversity. In Proc. 7th Workshop on Algorithms in Bioinformatics(WABI07) LNBI, pp 73-84, 2007.

J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. GEDI: Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity. (in preparation).

DNA-barcoding: B. Pasaniuc, S. Kentros and I.I. Mandoiu. DNA Barcode Data Analysis: Boosting

Assignment Accuracy by Combining Distance- and Character-Based Classifiers, The DNA Barcode Data Analysis Initiative (DBDAI): Developing Tools for a New Generation of Biodiversity Data Workshop, 2006.

B. Pasaniuc, S. Kentros and I.I. Mandoiu. Model-based species identification using DNA barcodes, 39th Symposium on the Interface: Computing Science and Statistics, 2007.

B. Pasaniuc, A. Gusev, S. Kentros, J. Lindsay and I.I. Mandoiu. A Comparison of Algorithms for Species Identification Based on DNA Barcodes. 2nd International Barcode of Life Conference, Academia Sinica, Taipei, Taiwan, Sept. 17-21, 2007