View
215
Download
0
Tags:
Embed Size (px)
Citation preview
BioFilterAn architecture for parallel deployment and dynamic chaining of
standalone BioInformatics tools.
Master’s ThesisAvinash Kewalramani
Outline• Motivation.
• Abstract.
• Software Patterns and BioFilter.
• Biofilter architecture.
• Performance results.
• Future work.
• Conclusion.
Motivation• Most Bioinformatics tools are processor intensive.• Few have capabilties to run on multi-processor
machines• Almost none have capabilty to be run on Beowulf
clusters.• Bioinformatics analysis is mostly based on multiple
tools with their Input/Output chained together.• Human intervention to set up this chaining • BioFilter provides the above two capabilities
– Easy and fast deployment of such tools on the cluster providing a huge gain in performance
– Allows creation of dynamic pipelines of different analysis tools as per the requirement.
Abstract• Uses “Pipes and Filter pattern” to setup dynamic pipelines.• It uses “Client-Dispatcher-Server” pattern for parallelization.• BioFilter is not tool-specific but more of a plug and play
environment.• Original tools are not altered. So output is consistent with single
CPU machine runs• Cluster hardware, operating system and queuing system
independent• Operating system on the cluster should support multiprocessing
via forks• BioFilter code has been implemented in Object Oriented Perl• Cluster should have a shared disk system and Perl installed.
Typical Annotation Pipelines
BlastParser
Prodom
Blocks
Prosite
BlastGenome Glimmer
tRNAscanPsort
Coils
Seg
BlastParserBlastFastaRecordsFile
Sofware Patterns and BioFilter
• Patterns are based on a three part schema– Context
– Problem
– Solution
• Pattern categories– Architectural
– Design
Pipes and Filters pattern• Context : Processing of data streams
• Problem– Specifications and Forces
• Avoiding Monolithic system design• Classes which perform single transformation or analysis
provide greater flexibilty and reuse. • System can be built by different people on the team.• Dynamic combination of such classes to form pipelines.• Common interface provides transparent use.
Pipes and Filters pattern
• Solution– Divide the task into several sequential
processing steps with the data flowing through the system connecting these steps.
– Components• Filters – Active/Passive
• Pipes – Queues/method calls
• Data Source -Provider
• Data Sink -Consumer
Pipes and Filter patternStatic structure of a Pull Filter for Blast Pipeline
Pipes and Filters PatternDynamic structure of a Pull Filter for Blast Pipeline
Client-Dispatcher-Server Pattern
• Context: A software system integrating a set of distributed servers, with the servers running locally or distributed over a network.
• Problem– Specification and Forces
• Software system that uses servers distributed over a network must provide a means of communication between them
• Separation of Functional/Communication code.• Service providers should be location independent.
Client-Dispatcher-Server Pattern
• Solution– Components
• Client – Carries out domain specific tasks
– Requests the dispatcher to set up a communication channel with the server.
• Server– Provides set of operations to the clients
– Registers with the dispatcher
• Dispatcher– Location transparency for the Servers.
Client-Dispatcher-Server patternStatic structure for Blast Filter
Client-Dispatcher-Server patternDynamic structure for Blast Filter
BioFilter ArchitectureFramework
• Splitting of the input data set and the database– Database splitting model is cumbersome.– Monolithic shared disk database with input data
splitting is easier.– Capacity computing model
• Exclusivity of partitions of the input data set • Capability to process one input partition at a time.
– Synchronization of results is an issue.
• Queuing System– Interaction with queuing system is part of the
architecture
BioFilter ArchitectureStatic structure-Blast Pipeline
BioFilter ArchitectureDynamic structure-Blast Pipeline
BioFilter Architecture
• Crash Recovery– TCP/IP based socket communication is blocking– Timed sockets overcome blocking– Timed waiting is predetermined but dynamically
computed dependent on the computational intensity of the job submitted in the communication between components.
• Interaction with nodes outside the cluster is possible by manually starting the server or by rsh utilities
Filter Variants
• Simple Filters.• Split Filters.• Join Filters
FastaRecordsFile BuildImmFilterFasta2TblFilterOrfFilter
TranslationFilterGlimmerFilter
Split Simple Join
Simple
Split
Concrete Source
Performance Results
• 240 node cluster, Each node is a Pentium III 1200 Mhz processor
• 20GB local disk and memory ranging from 1-2GB.• Shared disk via Netapps disk server.• Speedup Ratio for x nodes= Time to run the job on the
initial number of nodes/Time to run the job on x nodes
Performance ResultsBenchmark Test 1
• 1000 protein sequences from Bacteria (287 AA average length) blast against NR database.
• Servers ranging from 1-200
• Non-linear increase in performance– Shared Database
bottleneck– Single Dispatcher
Performance ResultsBenchmark Test 2 –Data Scalability
Performance ResultsBenchmark Test 3-It’s just not a Blast accelerator
Performance ResultsBenchmark Test 3
Future Work• Tests required
to analyse and cure the performance bottleneck.
• Forking the pipeline for added parallelism, requires a queue type of datastructure
ConclusionAdvantages
•Eliminate intermediate files, provides filter reuse and rapid prototyping of pipelines by filter recombination and exchange•Provides an environment for easy deployment of tools which are embarrasingly parallel on the Beowulf clusters. •Development time and Run time preformance gain. •Encapsulates the queuing system information in the implementation.•Interaction with nodes outside the cluster.•Provides exchangeabilty of servers, location and migration transperancy, reconfiguration of servers and fault tolerance
ConclusionDisadvantages
• Speed increase not proportional to the number of nodes.
•Absence of a GUI based interface.
• Architecture not tested with tools where the input data cannot be split.
•Error handling is very difficult.
•Pipeline disaster recovery is difficult.
•Client-Dispatcher-Server pattern makes the architecture slower.
Thanks and Possibilities
• Advisors– Dr Sun Kim (Indiana University-School of
Informatics)– Thomas Brettin (Los Alamos National Labs)
• Clients or Open Source.– California Digital.– Rocketcalc.– Microway.– Western Scientific.
References• S. Salzberg, A. Delcher, S. Kasif, and O. White.,"Microbial gene identification using interpolated Markov models"
Nucleic Acids Research 26:2 (1998), 544-548.• A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg., "Improved microbial gene identification with
GLIMMER" Nucleic Acids Research, 27:23, 4636-4641. • Lowe, T.M. & Eddy, S.R. (1997) ``tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic
sequence'', Nucl. Acids Res., 25, 955-964. • Altschul, S.F.,et al., "Basic local alignment search tool.". J Mol Bio. 215 403-410(1990)• Nakai K, Kanehisa M.,"Expert system for predicting protein localization sites in gram-negative bacteria." Proteins.
1991;11(2):95-110. • Lupas, A., Van Dyke, M., and Stock, J.,"Predicting Coled Coils from Protein Sequences", Science 252:1162-1164.• Wootton, J. C. and S. Federhen (1993)., "Statistics of local complexity in amino acid sequences and sequence
databases.", Computers in Chemistry 17:149-163. • Wootton, J. C. and S. Federhen (1996).," Analysis of compositionally biased regions in sequence databases.
Methods in Enzymology 266: 554-571. • Servant F, Bru C, Carrère S, Courcelle E, Gouzy J, Peyruc D, Kahn D (2002) ProDom: Automated clustering of
homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251 • J.G. Henikoff, E.A. Greene, S. Pietrokovski & S. Henikoff, "Increased coverage of protein families with the blocks
database servers", Nucl. Acids Res. 28:228-230 (2000). • S.Henikoff, J.G.Henikoff & S. Pietrokovski, "Blocks+: A non-redundant database of protein alignment blocks derived
from multiple compilations", Bioinformatics 15(6):471-479 (1999). • Sigrist C.J., Cerutti L., Hulo N., Gattiker A., Falquet L., Pagni M., Bairoch A., Bucher P., "PROSITE: a documented
database using patterns and profiles as motif descriptors." Brief Bioinform. 3:265-274(2002).• Mark Grand. "Patterns in Java .Volume 1" ,Second Edition,Wiley Publication Inc ,2002• Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad,Michael Stal."Pattern Oriented Software
Architecture Volume 1:A system of Patterns" Chicester,England:John Wiley and Sons, 1996.