Upload
ike
View
41
Download
1
Embed Size (px)
DESCRIPTION
Genome-scale Disk-based Suffix Tree Indexing. Phoophakdee and Zaki. Outline. Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion. Example Suffix Tree. Sequence ACGACG$ What are Suffix Links. Suffix tree runtime. Time complexity - PowerPoint PPT Presentation
Citation preview
GENOME-SCALE DISK-BASED SUFFIX TREE INDEXINGPhoophakdee and Zaki
OUTLINE Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion
EXAMPLE SUFFIX TREE Sequence
ACGACG$ What are Suffix
Links
SUFFIX TREE RUNTIME Time complexity
Construction of suffix tree: O(n) time and space where n is the size of the text
being searched Substring Search:
O(m) time where m is size of substring/search pattern Knuth-Morris-Pratt and Boyer-Moore algorithm
comparison
APPLICATION IN BIOINFORMATICS Database search Exact matching Approximate matching* Longest common substring Genome alignment* Structural motifs* Tandem repeats* Sequence comparison
PROBLEMS WITH GENOME-SCALE SUFFIX TREES Efficient O(n) suffix tree generating
algorithms Tree must fit entirely in main memory e.g. Ukkonen’s algorithm
Genomes are very large Human genome is 3 Gbp (0.75 GB) Data structure no longer able to fit in memory
WHAT TRELLIS SOLVES Prevents data skew in prefix partitioning
Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory.
From non-uniform distribution of alphabit/DNA Efficient disk-base implementation
Function under low memory constraints Efficient disk IO usage
Able to recover suffix links
TRELLIS STEPS Prefix Creation Phase Partitioning Phase Merging Phase Suffix Link Recovery Phase (Optional)
TRELLIS OVERVIEW
MERGING PHASE
THRESHOLD (t) Determines partition of sequence
Suffix subtree fits into memory during partitioning phase.
Determines cutoff for prefix set inclusion Recombined prefixed suffix subtree will fit
entirely into memory during merging phase. Allows input string and two sets of internal
nodes to fit entirely into memory during suffix link recovery phase
TRELLIS OVERVIEW
PERFORMANCE O(n2) time and O(n) space (where n is
sequence length) Comparison to TDD
Currently only other algorithm that scales up to genome level
Same time complexity Does not calculate suffix links
SUFFIX TREE CONSTRUCTION
QUERY TIMES
QUERY TIMES
CONCLUSION Efficient disk-based suffix tree generation
that works well with limited memory Suffix links are recoverable Future work
Extend to larger alphabets Buffer input sequence Parallelize partitioning and merging