Upload
bioinformaticsinstitute
View
119
Download
1
Embed Size (px)
Citation preview
Comparative Genomics and the
de Bruijn graphs
Ilia Minkin
Pennsylvania State University
16th September 2016
1 / 43
What is comparative genomics?
�The collection of all research activities that derivebiological insights by comparing genomic features.�1
Why do it?
I Learn evolution
I Learn function
1�Comparative Genomics�, Xuhua Xia2 / 43
What is comparative genomics?
�The collection of all research activities that derivebiological insights by comparing genomic features.�1
Why do it?
I Learn evolution
I Learn function
1�Comparative Genomics�, Xuhua Xia2 / 43
Learn Evolution
3 / 43
Learn Function
A genomic sequence itself does not show its functions
How to �nd function?
I Compare with sequences of know function
I Conserved sequences are likely to be important
How to compare genomes?
4 / 43
What is an Alignment?
Organisms inherit genomes but with �errors�:
The Ancestor
Genome A Genome B
Which characters A and B got from its ancestor?
5 / 43
What is an Alignment?
Alignments are written down as a table:
ACTG-TGAACTACTGA
Blue letters are matches; yellow are mismatches;dashes are indels.
This is a global alignment.
6 / 43
The Global Alignment
ACTG-TGAACTACTGA
For two strings A and B :
I Place them under each other
I Insert into A and B dashes so that |A| = |B|
I Penalize for dashes and mismatches
I Which alignment gives the least penalty?
I Complexity: O(|A||B|)
7 / 43
The Local AlignmentFor large sequences the global alignment does notwork:
GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA
I Apart from indels and mismatches there couldbe rearrangements
I Rearrangements change orders of the wholeblocks
I Similar subsequences can be interleaved withsomething else
8 / 43
The Local AlignmentFor large sequences the global alignment does notwork:
GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA
I Apart from indels and mismatches there couldbe rearrangements
I Rearrangements change orders of the wholeblocks
I Similar subsequences can be interleaved withsomething else
8 / 43
The Local Alignment
GAACTGTGATTAGGACGTATTTGGGACTACTGAGTA
A problem: for A and B �nd their most similarsubsequences and their alignment:
ACTG-TGAACTACTGA
Complexity: O(|A||B|)
9 / 43
An ExampleWe can generalize to many genomes:
GAACTGTGATTATGCTCAATTTGGGACTACTGAGTAATCTTGAGATAGCTGAAA
Alignments:
ACTG-TGAACTACTGAA-TGCTCA
10 / 43
An ExampleWe can generalize to many genomes:
GAACTGTGATTATGCTCAATTTGGGACTACTGAGTAATCTTGAGATAGCTGAAA
Alignments:
ACTG-TGAACTACTGAA-TGCTCA
10 / 43
Multiple Local Alignment
Issues:
I Some subsequences can be present in somegenomes and absent in others
I Genomes can have duplications
I Multiple sequence alignment is NP-hard
→ We need some heuristics
11 / 43
Multiple Local Alignment
Issues:
I Some subsequences can be present in somegenomes and absent in others
I Genomes can have duplications
I Multiple sequence alignment is NP-hard
→ We need some heuristics
11 / 43
Another Approach
Another way to �nd common subsequences is tobuild a graph from genomes
In such a graph homologous subsequences willcollapse into non-branching paths while unique oneswill form disjoint paths
12 / 43
The Linear Representation
Two genomes:
13 / 43
Solution: a Graph Representation
What we want to see:
14 / 43
Genomes as a Railroad
15 / 43
Why de Bruijn graph?
A simple object.
Demonstrated utility in:
I Assembly
I Read mapping
I Synteny identi�cation
16 / 43
The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
AC TTGATG TCCT
17 / 43
The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
AC TTGATG TCCT
17 / 43
The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
AC TTGATG TCCT
17 / 43
The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
AC
GT
TT
CG
GATG TC
CT
18 / 43
The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
AC
GT
TT
CG
GATG TC
CT
18 / 43
The de Bruijn graph
In the de Bruijn graph identical substrings of lengthat least k + 1 are collapsed into non-branching paths
We can use this to �nd homologous blocks.
We developed a tool �Sibelia" that �nds such blocksin many bacterial genomes and handles repeats.
But we can do more.
19 / 43
Alignment to a GraphIt is common to have an unassembled genome
Reads are then aligned to a very similar referencegenome:
20 / 43
Alignment to a GraphIssues:
I More than one reference?I Repeats within genomes?
Solution: align reads to a graph!
21 / 43
Alignment to a GraphIssues:
I More than one reference?I Repeats within genomes?
Solution: align reads to a graph!
21 / 43
Alignment to a Graph
In the future genome graphs will encode informationabout a population
Aligning reads to a graph has many advantages:
I E�cient alignment to many genomes
I Reusing information about variants
I Handling of repeats
The de Bruijn graph is a feasible model for a graphreference.
Issue the graph can be too large.
22 / 43
Alignment to a Graph
In the future genome graphs will encode informationabout a population
Aligning reads to a graph has many advantages:
I E�cient alignment to many genomes
I Reusing information about variants
I Handling of repeats
The de Bruijn graph is a feasible model for a graphreference.
Issue the graph can be too large.
22 / 43
Alignment to a Graph
In the future genome graphs will encode informationabout a population
Aligning reads to a graph has many advantages:
I E�cient alignment to many genomes
I Reusing information about variants
I Handling of repeats
The de Bruijn graph is a feasible model for a graphreference.
Issue the graph can be too large.
22 / 43
Alignment to a Graph
In the future genome graphs will encode informationabout a population
Aligning reads to a graph has many advantages:
I E�cient alignment to many genomes
I Reusing information about variants
I Handling of repeats
The de Bruijn graph is a feasible model for a graphreference.
Issue the graph can be too large.
22 / 43
Compaction
After compaction:
TGAC ACGTC
ACTTCTG AC TC
23 / 43
Compaction
After compaction:
TGAC ACGTC
ACTTCTG AC TC
23 / 43
The Challenge
Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.
Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.
A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.
24 / 43
The Challenge
Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.
Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.
A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.
24 / 43
The Challenge
Construct the compacted graph from many largegenomes bypassing the ordinary graph traverse.
Earlier work: based on su�x arrays/trees Sibelia &SplitMEM handled > 60 E.Coli genomes.
A recent advance: 7 Humans in 15 hours using 100GB of RAM using a BWT-based algorithm by Baieret al., 2015, Beller et al., 2014.
24 / 43
Junctions
A vertex v is a junction if:
I v has ≥ 2 distinct outgoing or incoming edges:
I v is the �rst or the last k-mer of an input string
Facts:
I Junctions = vertices of the compacted graph
I Compaction = �nding positions of junctions
25 / 43
Junctions
A vertex v is a junction if:
I v has ≥ 2 distinct outgoing or incoming edges:
I v is the �rst or the last k-mer of an input string
Facts:
I Junctions = vertices of the compacted graph
I Compaction = �nding positions of junctions
25 / 43
Junctions
A vertex v is a junction if:
I v has ≥ 2 distinct outgoing or incoming edges:
I v is the �rst or the last k-mer of an input string
Facts:
I Junctions = vertices of the compacted graph
I Compaction = �nding positions of junctions
25 / 43
Observations
TGAC ACGTC
ACTTCTG AC TC
TG GA AC CG GT TC
TG → AC → TC
26 / 43
Observations
TGAC ACGTC
ACTTCTG AC TC
TG GA AC CG GT TC
TG → AC → TC
26 / 43
Observations
TGAC ACGTC
ACTTCTG AC TC
TG GA AC CG GT TC
TG → AC → TC
26 / 43
The Observation
The observation only works when we have completegenomes.
Once we know junctions, construction of the edges issimple.
We can simply traverse input strings and recordjunctions in the order they appear.
How to identify junctions?
27 / 43
The Naive Algorithm
A naive way:
I Store all (k + 1)-mers (edges) in a hash table
I Consider each vertex one by one
I Query all possible edges from the table
I If found > 1 edge, mark vertex as a junction
Problem: the hash table can be too large.
28 / 43
The Naive Algorithm
A naive way:
I Store all (k + 1)-mers (edges) in a hash table
I Consider each vertex one by one
I Query all possible edges from the table
I If found > 1 edge, mark vertex as a junction
Problem: the hash table can be too large.
28 / 43
An ExampleHash table = { GA → AC }
AA
AG
AC
AT
GA
29 / 43
What is the Bloom �lter
A probabilistic data structure representing a set
Properties:
I Occupies �xed space
I May generate false positives on queries
I False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
Is GA → AT in the set? Maybe no.
30 / 43
What is the Bloom �lter
A probabilistic data structure representing a set
Properties:
I Occupies �xed space
I May generate false positives on queries
I False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
Is GA → AT in the set? Maybe no.
30 / 43
What is the Bloom �lter
A probabilistic data structure representing a set
Properties:
I Occupies �xed space
I May generate false positives on queries
I False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
Is GA → AT in the set? Maybe no.
30 / 43
An ExampleBloom Filter = { GA → AC, GA → AT }
AA
AG
AC
AT
GA
The purple edge is a false positive.31 / 43
The Two Pass Algorithm
How to eliminate false positives?
Two-pass algorithm:
1. Use the Bloom �lter to identify junctioncandidates
2. Use the hash table, but store only edges that
touch candidates
32 / 43
The Two Pass Algorithm
How to eliminate false positives?
Two-pass algorithm:
1. Use the Bloom �lter to identify junctioncandidates
2. Use the hash table, but store only edges that
touch candidates
32 / 43
An Example: the First Step
Here edges stored in the Bloom �lter, purple ones arefalse positives:
AC GT
CC
TT
CG
AT
GATG
TC
CT
Junction candidates: GA & AC
33 / 43
An Example: the Second Step
Edges stored in the hash table. We kept only edgestouching junction candidates:
Junction: AC34 / 43
Results
Datasets:
I 7 humans: 5 versions of the reference +2 haplotypes of NA12878 from 1000 Genomes
I 93 simulated humans (FIGG)
I 8 primates available in UCSC genome browser
35 / 43
Results
Running time (minutes) & memory usage (GBs).
# genomes BWT-based TwoPaCo1 thread 1 thread 15 threads
Humans7, k = 25 867 (100.30) 436 (4.40) 63 (4.84)7, k = 100 807 (46.02) 317 (8.42) 57 (8.75)43+7, k = 25 - - 705 (69.77)43+7, k = 100 - - 927 (70.21)93+7, k = 25 - - 1383 (77.42)Primates8, k = 25 - 914 (34.36) 111 (34.36)8,k = 100 - 756 (56.06) 101 (61.68)
36 / 43
Conclusion & Future Work
Advantages of the algorithm:
I Fast
I Small memory footprint
I Can handle large inputs
Drawbacks:
I Less applicable for large k
Take home message: it is easy to construct thecompacted de Bruijn graph for complete genomes.
37 / 43
Conclusion & Future Work
Advantages of the algorithm:
I Fast
I Small memory footprint
I Can handle large inputs
Drawbacks:
I Less applicable for large k
Take home message: it is easy to construct thecompacted de Bruijn graph for complete genomes.
37 / 43
Conclusion & Future Work
Can potentially facilitate:
I Visualization
I Synteny mining (Sibelia)
I Structural variations analysis
I ...
38 / 43
Acknowledgments
Personal:
I Daniel Lemire
I GFA format working group
Funding, NSF awards:
I DBI-1356529
I CCF-1439057
I IIS-1453527
I IIS-1421908
39 / 43
Thank you for your attention!
Twitter: @IliaMinkin
40 / 43
Input Size vs. Performance
41 / 43
Parallel Scalability
42 / 43
Splitting
Table 1: The minimal number of rounds it takes to compressthe graph without exceeding a given memory threshold.
Memory threshold Used memory Bloom �lter size Running time Rounds
10 8.62 8.59 259 1
8 6.73 4.29 434 3
6 5.98 4.29 539 4
4 3.51 2.14 665 6
43 / 43