28
UNIVERSITY OF MILANO – BICOCCA School of Science PhD in Computer Science Haplotype Assembly: Algorithms, Models and Complexity Supervisor: Prof. Paola Bonizzoni Tutor: Prof. Alberto Leporati Student: Simone Zaccaria Cycle XXIX PhD Presentation June 2015

Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

UNIVERSITY OF MILANO – BICOCCA

School of Science

PhD in Computer Science

Haplotype Assembly:Algorithms, Models and Complexity

Supervisor:Prof. Paola Bonizzoni

Tutor:Prof. Alberto Leporati

Student:

Simone ZaccariaCycle XXIX

PhD PresentationJune 2015

Page 2: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Diploid Genomes

• Diploid genomes: two sets of chromosomes, one from the father and one from the mother

Page 3: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Haplotypes

• Haplotype = a copy of one chromosome

• Represented as a binary vector (0/1 for major/minor allele)

1

0

0

1

.

.

.

1

0

1

0

0

1

1

.

.

.

1

1

0

Page 4: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

SNPs

• Haplotype = a copy of one chromosome

• Represented as a binary vector (0/1 for major/minor allele)

• Haplotype differences: Single Nucleotide Polymorphisms (SNPs)

SNPs1

0

0

1

.

.

.

1

0

1

0

0

1

1

.

.

.

1

1

0

Page 5: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Haplotype Assembly

• Haplotype Assembly (reconstruction of the two copies of the chromosomes),crucial for many applications(Browning and Browning, Nature Rev. Genet., 2011) :

Analyzing the relationship between genetic variations and disease susceptibility(Tewhey et al., Nature Rev. Genet., 2011)

Detecting novel and recurrent mutations or inferring point of recombination(Kong, A. et al., Nature Genet., 2008)

Genomic medicine (Glusman et al. , Genome Medicine, 2014)

A

A

C

.

.

T

G

C

T

A

G

.

.

T

A

G

Page 6: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Sequencing data

sequencing mapping

True complete haplotypes

Collection of fragments

Aligned fragments

1001...101

0011...110

Page 7: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Reconstruct haplotypes

????...???

????...???

• The fragments are bipartite in order to reconstruct the two haplotypes

• The presence of sequencing and mapping errors leads to optimization problems: Minimum Error Correction (MEC) is one of the most prominent

Page 8: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Fragment Matrix

• Fragment matrix, on n fragments and m SNP positions:

Each row corresponds to a fragment

Each column corresponds to a genome position

1 0 - 1 1 1 - 0 1 -

- - 0 0 1 - 1 0 1 -

- 1 0 0 0 - - 1 - 1

- - 0 1 0 - 1 1 - 0

- - - 0 0 0 1 1 1 1

fr

agm

ents

Genome positions

Page 9: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Minimum Error Correction (MEC) problem

frag

men

ts

1 0 - 1 1 1 - 0 1 -

- - 0 0 1 - 1 0 1 -

- 1 0 0 0 - - 1 - 1

- - 0 1 0 - 1 1 - 0

- - - 0 0 0 1 1 1 1

Genome positions

• Input: Fragment matrix

• Output: Minimum number of error corrections that allow to bipartite the fragments without internal conflict (conflict free)

Page 10: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Minimum Error Correction (MEC) problem

frag

men

ts

1 0 - 1 1 1 - 0 1 -

- - 0 0 1 - 1 0 1 -

- 1 0 0 0 - - 1 - 1

- - 0 1 0 - 1 1 - 0

- - - 0 0 0 1 1 1 1

Genome positions

Conflicts

• Input: Fragment matrix

• Output: Minimum number of error corrections that allow to bipartite the fragments without internal conflict (conflict free)

Page 11: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Minimum Error Correction (MEC) problem

frag

men

ts

1 0 - 1 1 1 - 0 1 -

- - 0 1 1 - 1 0 1 -

- 1 0 0 0 - - 1 - 1

- - 0 0 0 - 1 1 - 1

- - - 0 0 0 1 1 1 1

1 0 0 1 1 1 1 0 1 -

- 1 0 0 0 0 1 1 1 1

• Input: Fragment matrix

• Output: Minimum number of error corrections that allow to bipartite the fragments without internal conflict (conflict free)

Page 12: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Contributions

1. Theoretical contribution:

study of computational complexity, approximability, and fixed-parameter tractability of MEC

2. Algorithmic contribution:

novel promising algorithm, HapCol, that has been implemented and experimentally compared with current state-of-the-art approaches

Page 13: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Variants: Holes and Gaps

Gapless MEC (holes, no gaps)

Binary MEC (no holes, no gaps)

MEC(holes, gaps)

(Cilibrasi et al., Algorithmica, 2007)

0 0 - - 1 1 - - - -

- 0 - - 0 - 0 - - -

- 1 1 - - 1 0 - - -

0 0 1 0 1 1 - - - -

- 0 0 1 0 1 0 - - -

- - - - 1 0 1 0 1 0

0 0 1 0 1 1 0 1 0 1

0 0 0 1 0 1 0 1 0 1

0 0 1 1 1 0 1 0 1 0

Page 14: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

• FPT by coverage(Patterson et al., RECOMB, 2014)

State of the art

APX-hard

?

PTAS

NP APX FPT by

(Cilibrasi et al., Algorithmica, 2007)

Gapless MEC (holes, no gaps)

Binary MEC (no holes, no gaps)

MEC(holes, gaps)

NP-hard

NP-hard

?

• With restrictiveassumptions(He et al., Bioinformatics, 2013)

Other FPT

Page 15: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

APX-hard

not in APX

log-apx

Contributions

?

NP APX FPT by

Gapless MEC (holes, no gaps)

Binary MEC (no holes, no gaps)

MEC(holes, gaps)

NP-hard

NP-hard

?

With restrictiveassumptions(He et al., Bioinformatics, 2013)

Without anyassumption

• FPT by coverage(Patterson et al., RECOMB, 2014)

• By total number ofcorrections

PTAS

(More direct) 2-apx algo.

Other FPT

Page 16: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Contributions

1. Theoretical contribution:

study of computational complexity, approximability, and fixed-parameter tractability of MEC

2. Algorithmic contribution:

novel promising algorithm, HapCol, that has been implemented and experimentally compared with current state-of-the-art approaches

Page 17: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

State of the Art

Tool Approach PROS CONS

RefHap(Duitama et al., NAR, 2012)

Heuristic• Good trade-off

accuracy/performance

• No guarantee of optimality• Requires previous SNP

calling

ProbHap(Kuleshov, Bioinformatics, 2014)

Probabilistic • Good accuracy

• Computationally demanding

• Requires previous SNP calling

WhatsHap(Patterson et al., RECOMB, 2014)

Exact (FPT)

• Designed for long reads and quality scores

• Outperforms other exact approaches

• Instances with limited size

Page 18: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Perspectives

• Novel features of future-generation sequencing technologies (Roberts, Genome Biol, 2013):

Very long fragments: > 10kbp and span several SNPs=> improve accuracy

Uniform distribution of sequencing errors=> better error models

Deal with larger instances=> no need of previous SNP calling

Page 19: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

HapCol

• A novel k-constrained variant of MEC (k-cMEC): uniformly distributed errors

Bound maximum number of corrections => Limit on the space of solutions

Less resources and deals with larger instances

• A novel algorithm HapCol:

Manages long fragments => polynomial dependent on fragment length

Exploits fragment quality scores => improved accuracy

• Implemented HapCol (C++) is freely available under GPL athttp://hapcol.algolab.eu

Page 20: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Experimental Comparison

• Comparison with RefHap, ProbHap, and WhatsHap

• Metrics: on performance: time and memory

on accuracy: phased positions and switch error rate(Browning and Browning, Nature Rev. Genet., 2011)

0000

1111

1 switcherror

00

11

11

00

Real Dataset Simulated Datasets

• Standard Benchmark• Low size• HapMap sample NA12878

• Medium-High size• Follow the characteristics of

future-generation sequencing technologies

Page 21: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Results

• Improved accuracy and number of phased positions on real data with good performance

• Significant improvement ofperformance on the collectionof simulated data

• Deal with larger instancesand improve accuracy

Page 22: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Future Directions

• Study new variants:

Extend MEC to structured populations, such as trios => improve accuracy

Genomes with more copies (plants, fishes, tumors,…) => k-ploid MEC

• Attempt to schedule abroad period to Brown University (Providence, USA) with prof. Ben Raphael:

Study other problems related to sequencing data (identifying mutations, tumor composition, …)

Page 23: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Publications

• Bonizzoni , P., Dondi, R., Klau, GW., Pirola*, Y., Pisanti, N., and Zaccaria*, S., On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem, CPM 2015, in printing.

• Bonizzoni , P., Dondi, R., Klau, GW., Pirola*, Y., Pisanti, N., and Zaccaria*, S., HapCol: Accurate and Memory-efficient Haplotype Assembly from Long Reads, Bioinformatics, in review. (it is going to be presented at HitSeq 2015)

Page 24: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Talks

• On the inversion-indel distancePRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy

• HAPCOL: Faster and memory-efficient Haplotype Assembly for Future-Generation-ReadsBITS 2015, Twelfth Annual Meeting of the Bioinformatics Italian Society June 3-5, 2015, Milan, Italy

• On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problemJune 16, 2015, Università di Milano-Bicocca, Milan, Italy

• On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problemCPM 2015, 26th Annual Symposium on Combinatorial Pattern MatchingJune 29-July 1, 2015, Ischia Island, Italy[to be held]

Page 25: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

PhD Courses and Schools

Courses:

1. Paradigms and Approaches to Computer Security with Prof. Ferretti, organized by University of Milano-Bicocca in February 2014.

2. Cluster Analysis with Prof. Messina, organized by University of Milano-Bicocca in May 2014.

3. Operational Research: Advanced algorithms on graphs with Prof. Malucelli, organized by Politecnico di Milano in April 2014.

4. Parallel computing using MPI and OpenMP with Prof. Breveglieri and Cremonesi, organized by CINECA, in collaboration with Politecnico di Milano, in June 2014.

5. English Course with Dr. Weekes, organized by University of Milano-Bicocca still underway until the end of 2014.

Schools:

1. School of Science, about Project Management, Intellectual property, HR Management, organized by University of Milano-Bicocca in May 2014.

2. Summer School on Advanced Approximation Algorithms with Prof. Grandoni, organized by Institute of Theoretical Computer Science, ETH Zürich, in June 2014.

3. International School on Graph Theory, Algorithms and Applications with Prof. Raffaele Cerulli, Prof. Andrew Goldberg, Prof. Giuseppe F. Italiano and Prof. Robert E. Tarjan as directors, organized by Ettore Majorana Foundation and Centre for Scientific Culture in Erice, in September 2014.

Page 26: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Teaching

• DCJ and genome rearrangements: models and algorithms4 hours, Bioinformatics course, master degreeOctober 2014, University of Milano-Bicocca

• Laboratory of programming in Java40 hours, bachelor degreeSeptember. 2014 – January, 2015

• Haplotype Assembly4 hours, Bioinformatics course, master degreeMay 2015, University of Milano-Bicocca

Page 27: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Thanks,questions?

[email protected]

Personal page:

http://algolab.eu/simone-zaccaria

Page 28: Haplotype Assembly · 2015. 7. 3. · PRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy •HAPCOL: Faster and memory-efficient Haplotype Assembly for

Variant Motivations

• Unrealistic for the Haplotype Assembly• Interesting from a mathematical point of view: variant of

the well-known Hamming 2-Median Clustering Problem

• Represent single-end fragments without gaps

• Represent any instance for the Haplotype Assembly• Fit for both paired-end and single-end fragments

(Cilibrasi et al., Algorithmica, 2007)

Gapless MEC (holes, no gaps)

Binary MEC (no holes, no gaps)

MEC(holes, gaps)