19
CONCURRENT BIOINFORMATICS SOFTWARE FOR DISCOVERING GENOME-WIDE PATTERNS AND WORD-BASED GENOMIC SIGNATURES Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch

Lichtenberg bosc2010 wordseeker

Embed Size (px)

Citation preview

Page 1: Lichtenberg bosc2010 wordseeker

CONCURRENT BIOINFORMATICS SOFTWARE FOR

DISCOVERING GENOME-WIDE PATTERNSAND WORD-BASED GENOMIC SIGNATURES

Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch

Page 2: Lichtenberg bosc2010 wordseeker

The WordSeeker Tool

Enumeration Suffix Tree and Suffix Array Radix Tree

Scoring Clustering

Sequence Clustering Word Clustering

Conservation Analysis Phast Cons Score Extraction

Location Distributions Sequence Coverage

Min set of words necessary tocover all sequences

Module Discovery Enumerative

Ranger Markup Basic Functional Elements

Page 3: Lichtenberg bosc2010 wordseeker

Software Properties

Google code repository: http://code.google.com/p/word-seeker/ GNU General Public License v3 Doxygen code generator (Internal Documentation). Svn for command line access: http://word-seeker.googlecode.com/svn/trunk

Requirements G++ compiler version 4.1* or higher OpenMP headers MPI environment (distributed version) For visualizations and other post-processing steps

Perl 5.8.8, TFBS (http://tfbs.genereg.net/) SET::Scalar LWP::Simple Parallel::Forkmanager GD::Graphs::bars, Algorithm::Cluster Bio::SeqIO (all available through CPAN) Gnuplot version 4.2 or higher

Page 4: Lichtenberg bosc2010 wordseeker

Need for a Scalable Approach

Word Enumeration Module

Represents a set of biological input sequences based on some data structure

Keeps track of words, word counts, sequence counts, and word locations

Need to keep the data persistent in memory

Word Scoring Module Determines statistical

scores for each word Frequent lookups for

words and substrings of words Example: Markov order m

model requires lookups for all substrings of up to length m for all words

Keep space complexity low Keep time complexity for

lookups low

Page 5: Lichtenberg bosc2010 wordseeker

Enumeration Approaches

Total number of nucleotides in the input sequences: n

Word length: m

Radix Tree

Time/Space Complexity: • dependent on m

Time Complexity for lookups: • fast lookup

Suffix Tree

Time/Space Complexity: • independent of m• significant constant factor for space complexity

Time Complexity for lookups: • fast lookup

Suffix Array

Time/Space Complexity: • much lower constant factor for space complexity

Time Complexity for lookups: • more expensive for lookups

)( mnO

)(mO

)(nO

)(mO

)(nO

)(lognO

Page 6: Lichtenberg bosc2010 wordseeker

Distributed Solution Tasks executed on

different nodes Distributed Memory

Multi-core Solution Tasks executed on

different cores Shared Memory

Solution

Parallelization

Page 7: Lichtenberg bosc2010 wordseeker

Parallel Software Properties

Shared Memory Open MP parallelization

Simple, portable, directives that compile even on non supported architectures

Simple loops are run in parallel on multiple processors

Distributed Memory MPI parallelization

Hardware optimizations and support for Fortran, C/C++, Perl

Each node is provided a subset of the data to process “Smart” division of tasks is key

Page 8: Lichtenberg bosc2010 wordseeker

Results

Analyzed the Arabidopsis thaliana genome All segments and the full genome Multiple word lengths (1-20) Searched top words against AGRIS

(repository of known elements in A. thaliana)

Characterized the Framework Speedup and runtime analysis Radix Trie and Suffix Tree

Page 9: Lichtenberg bosc2010 wordseeker

Memory Requirements for Arabidopsis thaliana

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

5

10

15

20

25

Size Requirements of Arabidopsis thaliana Segments

Intronic Regions5' UTR3' UTRCoding SequencesCore PromotersProximal PromotersDistal PromotersFull Genome

Word length

Siz

e (

GB

)

Conducted at the Ohio Supercomputer Center

Page 10: Lichtenberg bosc2010 wordseeker

Execution Times for Arabidopsis thaliana

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

5000

10000

15000

20000

25000

30000

Time Requirements of Arabidopsis thaliana Segments

Intronic Regions5' UTR3' UTRCoding SequencesCore PromotersProximal PromotersDistal PromotersFull Genome

Word length

Tim

e (

s)

Page 11: Lichtenberg bosc2010 wordseeker

Speedup, efficiency and timing using A. thaliana core promoter sequences.

Analyzing the Parallel System

Page 12: Lichtenberg bosc2010 wordseeker

Shared and Distributed Memory Speedup

Radix Trie Suffix Tree

Page 13: Lichtenberg bosc2010 wordseeker

Shared and Distributed Memory Efficiency

Radix Trie Suffix Tree

Page 14: Lichtenberg bosc2010 wordseeker

Shared and Distributed Memory Performance

Radix Trie Suffix Tree

Page 15: Lichtenberg bosc2010 wordseeker

Scoring Speedup Contribution

Runtime Scoring

1->2 1->4 1->80246

Radix Tree Scoring Speedups

5nt 10nt 20nt50nt 100nt

Cores

Speedup

1->2 1->4 1->80246

Suffix Tree Scoring Speedups

5nt 10nt 20nt50nt 100nt

Cores

Speedup

1->2 1->4 1->80246

Radix Tree Runtime Speedups

5nt 10nt 20nt50nt 100nt

Cores

Speedup

1->2 1->4 1->80

2

4

Suffix Tree Runtime Speedups

5nt 10nt 20nt50nt 100nt

Cores

Speedup

Page 16: Lichtenberg bosc2010 wordseeker

Results: Pushing the limits

Page 17: Lichtenberg bosc2010 wordseeker

Summary

Parallel Shared memory on single nodes Distributed memory on 5 nodes

High-throughput Full genomes analyzed in under 5 hours Long word lengths

Genomes approaching 20 Smaller files often 100 or greater

Powerful analysis Detailed statistics Degeneracy via clustering Additional post-processing (scatter plots, logos, etc.)

Page 18: Lichtenberg bosc2010 wordseeker

Future Work

Post-processing Word distributions Sequence clustering Gbrowse visualization

Further parallelization Within a node Greater distributed abstraction (more

prefixes)

Page 19: Lichtenberg bosc2010 wordseeker

QUESTIONS?