Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
UNIVERSITY OF MILANO – BICOCCA
School of Science
PhD in Computer Science
Haplotype Assembly:Algorithms, Models and Complexity
Supervisor:Prof. Paola Bonizzoni
Tutor:Prof. Alberto Leporati
Student:
Simone ZaccariaCycle XXIX
PhD PresentationJune 2015
Diploid Genomes
• Diploid genomes: two sets of chromosomes, one from the father and one from the mother
Haplotypes
• Haplotype = a copy of one chromosome
• Represented as a binary vector (0/1 for major/minor allele)
1
0
0
1
.
.
.
1
0
1
0
0
1
1
.
.
.
1
1
0
SNPs
• Haplotype = a copy of one chromosome
• Represented as a binary vector (0/1 for major/minor allele)
• Haplotype differences: Single Nucleotide Polymorphisms (SNPs)
SNPs1
0
0
1
.
.
.
1
0
1
0
0
1
1
.
.
.
1
1
0
Haplotype Assembly
• Haplotype Assembly (reconstruction of the two copies of the chromosomes),crucial for many applications(Browning and Browning, Nature Rev. Genet., 2011) :
Analyzing the relationship between genetic variations and disease susceptibility(Tewhey et al., Nature Rev. Genet., 2011)
Detecting novel and recurrent mutations or inferring point of recombination(Kong, A. et al., Nature Genet., 2008)
Genomic medicine (Glusman et al. , Genome Medicine, 2014)
A
A
C
.
.
T
G
C
T
A
G
.
.
T
A
G
Sequencing data
sequencing mapping
True complete haplotypes
Collection of fragments
Aligned fragments
1001...101
0011...110
Reconstruct haplotypes
????...???
????...???
• The fragments are bipartite in order to reconstruct the two haplotypes
• The presence of sequencing and mapping errors leads to optimization problems: Minimum Error Correction (MEC) is one of the most prominent
Fragment Matrix
• Fragment matrix, on n fragments and m SNP positions:
Each row corresponds to a fragment
Each column corresponds to a genome position
1 0 - 1 1 1 - 0 1 -
- - 0 0 1 - 1 0 1 -
- 1 0 0 0 - - 1 - 1
- - 0 1 0 - 1 1 - 0
- - - 0 0 0 1 1 1 1
fr
agm
ents
Genome positions
Minimum Error Correction (MEC) problem
frag
men
ts
1 0 - 1 1 1 - 0 1 -
- - 0 0 1 - 1 0 1 -
- 1 0 0 0 - - 1 - 1
- - 0 1 0 - 1 1 - 0
- - - 0 0 0 1 1 1 1
Genome positions
• Input: Fragment matrix
• Output: Minimum number of error corrections that allow to bipartite the fragments without internal conflict (conflict free)
Minimum Error Correction (MEC) problem
frag
men
ts
1 0 - 1 1 1 - 0 1 -
- - 0 0 1 - 1 0 1 -
- 1 0 0 0 - - 1 - 1
- - 0 1 0 - 1 1 - 0
- - - 0 0 0 1 1 1 1
Genome positions
Conflicts
• Input: Fragment matrix
• Output: Minimum number of error corrections that allow to bipartite the fragments without internal conflict (conflict free)
Minimum Error Correction (MEC) problem
frag
men
ts
1 0 - 1 1 1 - 0 1 -
- - 0 1 1 - 1 0 1 -
- 1 0 0 0 - - 1 - 1
- - 0 0 0 - 1 1 - 1
- - - 0 0 0 1 1 1 1
1 0 0 1 1 1 1 0 1 -
- 1 0 0 0 0 1 1 1 1
• Input: Fragment matrix
• Output: Minimum number of error corrections that allow to bipartite the fragments without internal conflict (conflict free)
Contributions
1. Theoretical contribution:
study of computational complexity, approximability, and fixed-parameter tractability of MEC
2. Algorithmic contribution:
novel promising algorithm, HapCol, that has been implemented and experimentally compared with current state-of-the-art approaches
Variants: Holes and Gaps
Gapless MEC (holes, no gaps)
Binary MEC (no holes, no gaps)
MEC(holes, gaps)
(Cilibrasi et al., Algorithmica, 2007)
0 0 - - 1 1 - - - -
- 0 - - 0 - 0 - - -
- 1 1 - - 1 0 - - -
0 0 1 0 1 1 - - - -
- 0 0 1 0 1 0 - - -
- - - - 1 0 1 0 1 0
0 0 1 0 1 1 0 1 0 1
0 0 0 1 0 1 0 1 0 1
0 0 1 1 1 0 1 0 1 0
• FPT by coverage(Patterson et al., RECOMB, 2014)
State of the art
APX-hard
?
PTAS
NP APX FPT by
(Cilibrasi et al., Algorithmica, 2007)
Gapless MEC (holes, no gaps)
Binary MEC (no holes, no gaps)
MEC(holes, gaps)
NP-hard
NP-hard
?
• With restrictiveassumptions(He et al., Bioinformatics, 2013)
Other FPT
APX-hard
not in APX
log-apx
Contributions
?
NP APX FPT by
Gapless MEC (holes, no gaps)
Binary MEC (no holes, no gaps)
MEC(holes, gaps)
NP-hard
NP-hard
?
With restrictiveassumptions(He et al., Bioinformatics, 2013)
Without anyassumption
• FPT by coverage(Patterson et al., RECOMB, 2014)
• By total number ofcorrections
PTAS
(More direct) 2-apx algo.
Other FPT
Contributions
1. Theoretical contribution:
study of computational complexity, approximability, and fixed-parameter tractability of MEC
2. Algorithmic contribution:
novel promising algorithm, HapCol, that has been implemented and experimentally compared with current state-of-the-art approaches
State of the Art
Tool Approach PROS CONS
RefHap(Duitama et al., NAR, 2012)
Heuristic• Good trade-off
accuracy/performance
• No guarantee of optimality• Requires previous SNP
calling
ProbHap(Kuleshov, Bioinformatics, 2014)
Probabilistic • Good accuracy
• Computationally demanding
• Requires previous SNP calling
WhatsHap(Patterson et al., RECOMB, 2014)
Exact (FPT)
• Designed for long reads and quality scores
• Outperforms other exact approaches
• Instances with limited size
Perspectives
• Novel features of future-generation sequencing technologies (Roberts, Genome Biol, 2013):
Very long fragments: > 10kbp and span several SNPs=> improve accuracy
Uniform distribution of sequencing errors=> better error models
Deal with larger instances=> no need of previous SNP calling
HapCol
• A novel k-constrained variant of MEC (k-cMEC): uniformly distributed errors
Bound maximum number of corrections => Limit on the space of solutions
Less resources and deals with larger instances
• A novel algorithm HapCol:
Manages long fragments => polynomial dependent on fragment length
Exploits fragment quality scores => improved accuracy
• Implemented HapCol (C++) is freely available under GPL athttp://hapcol.algolab.eu
Experimental Comparison
• Comparison with RefHap, ProbHap, and WhatsHap
• Metrics: on performance: time and memory
on accuracy: phased positions and switch error rate(Browning and Browning, Nature Rev. Genet., 2011)
0000
1111
1 switcherror
00
11
11
00
Real Dataset Simulated Datasets
• Standard Benchmark• Low size• HapMap sample NA12878
• Medium-High size• Follow the characteristics of
future-generation sequencing technologies
Results
• Improved accuracy and number of phased positions on real data with good performance
• Significant improvement ofperformance on the collectionof simulated data
• Deal with larger instancesand improve accuracy
Future Directions
• Study new variants:
Extend MEC to structured populations, such as trios => improve accuracy
Genomes with more copies (plants, fishes, tumors,…) => k-ploid MEC
• Attempt to schedule abroad period to Brown University (Providence, USA) with prof. Ben Raphael:
Study other problems related to sequencing data (identifying mutations, tumor composition, …)
Publications
• Bonizzoni , P., Dondi, R., Klau, GW., Pirola*, Y., Pisanti, N., and Zaccaria*, S., On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem, CPM 2015, in printing.
• Bonizzoni , P., Dondi, R., Klau, GW., Pirola*, Y., Pisanti, N., and Zaccaria*, S., HapCol: Accurate and Memory-efficient Haplotype Assembly from Long Reads, Bioinformatics, in review. (it is going to be presented at HitSeq 2015)
Talks
• On the inversion-indel distancePRIN10-11 meeting, November 28–29, 2015, Politecnico di Milano, Milan, Italy
• HAPCOL: Faster and memory-efficient Haplotype Assembly for Future-Generation-ReadsBITS 2015, Twelfth Annual Meeting of the Bioinformatics Italian Society June 3-5, 2015, Milan, Italy
• On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problemJune 16, 2015, Università di Milano-Bicocca, Milan, Italy
• On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problemCPM 2015, 26th Annual Symposium on Combinatorial Pattern MatchingJune 29-July 1, 2015, Ischia Island, Italy[to be held]
PhD Courses and Schools
Courses:
1. Paradigms and Approaches to Computer Security with Prof. Ferretti, organized by University of Milano-Bicocca in February 2014.
2. Cluster Analysis with Prof. Messina, organized by University of Milano-Bicocca in May 2014.
3. Operational Research: Advanced algorithms on graphs with Prof. Malucelli, organized by Politecnico di Milano in April 2014.
4. Parallel computing using MPI and OpenMP with Prof. Breveglieri and Cremonesi, organized by CINECA, in collaboration with Politecnico di Milano, in June 2014.
5. English Course with Dr. Weekes, organized by University of Milano-Bicocca still underway until the end of 2014.
Schools:
1. School of Science, about Project Management, Intellectual property, HR Management, organized by University of Milano-Bicocca in May 2014.
2. Summer School on Advanced Approximation Algorithms with Prof. Grandoni, organized by Institute of Theoretical Computer Science, ETH Zürich, in June 2014.
3. International School on Graph Theory, Algorithms and Applications with Prof. Raffaele Cerulli, Prof. Andrew Goldberg, Prof. Giuseppe F. Italiano and Prof. Robert E. Tarjan as directors, organized by Ettore Majorana Foundation and Centre for Scientific Culture in Erice, in September 2014.
Teaching
• DCJ and genome rearrangements: models and algorithms4 hours, Bioinformatics course, master degreeOctober 2014, University of Milano-Bicocca
• Laboratory of programming in Java40 hours, bachelor degreeSeptember. 2014 – January, 2015
• Haplotype Assembly4 hours, Bioinformatics course, master degreeMay 2015, University of Milano-Bicocca
Thanks,questions?
Personal page:
http://algolab.eu/simone-zaccaria
Variant Motivations
• Unrealistic for the Haplotype Assembly• Interesting from a mathematical point of view: variant of
the well-known Hamming 2-Median Clustering Problem
• Represent single-end fragments without gaps
• Represent any instance for the Haplotype Assembly• Fit for both paired-end and single-end fragments
(Cilibrasi et al., Algorithmica, 2007)
Gapless MEC (holes, no gaps)
Binary MEC (no holes, no gaps)
MEC(holes, gaps)