BioFilter An architecture for parallel deployment and dynamic chaining of standalone BioInformatics tools. Master’s Thesis Avinash Kewalramani

BioFilterAn architecture for parallel deployment and dynamic chaining of

standalone BioInformatics tools.

Master’s ThesisAvinash Kewalramani

Outline• Motivation.

• Abstract.

• Software Patterns and BioFilter.

• Biofilter architecture.

• Performance results.

• Future work.

• Conclusion.

Motivation• Most Bioinformatics tools are processor intensive.• Few have capabilties to run on multi-processor

machines• Almost none have capabilty to be run on Beowulf

clusters.• Bioinformatics analysis is mostly based on multiple

tools with their Input/Output chained together.• Human intervention to set up this chaining • BioFilter provides the above two capabilities

– Easy and fast deployment of such tools on the cluster providing a huge gain in performance

– Allows creation of dynamic pipelines of different analysis tools as per the requirement.

Abstract• Uses “Pipes and Filter pattern” to setup dynamic pipelines.• It uses “Client-Dispatcher-Server” pattern for parallelization.• BioFilter is not tool-specific but more of a plug and play

environment.• Original tools are not altered. So output is consistent with single

CPU machine runs• Cluster hardware, operating system and queuing system

independent• Operating system on the cluster should support multiprocessing

via forks• BioFilter code has been implemented in Object Oriented Perl• Cluster should have a shared disk system and Perl installed.

Typical Annotation Pipelines

BlastParser

Prodom

Blocks

Prosite

BlastGenome Glimmer

tRNAscanPsort

Coils

Seg

BlastParserBlastFastaRecordsFile

Sofware Patterns and BioFilter

• Patterns are based on a three part schema– Context

– Problem

– Solution

• Pattern categories– Architectural

– Design

Pipes and Filters pattern• Context : Processing of data streams

• Problem– Specifications and Forces

• Avoiding Monolithic system design• Classes which perform single transformation or analysis

provide greater flexibilty and reuse. • System can be built by different people on the team.• Dynamic combination of such classes to form pipelines.• Common interface provides transparent use.

Pipes and Filters pattern

• Solution– Divide the task into several sequential

processing steps with the data flowing through the system connecting these steps.

– Components• Filters – Active/Passive

• Pipes – Queues/method calls

• Data Source -Provider

• Data Sink -Consumer

Pipes and Filter patternStatic structure of a Pull Filter for Blast Pipeline

Pipes and Filters PatternDynamic structure of a Pull Filter for Blast Pipeline

Client-Dispatcher-Server Pattern

• Context: A software system integrating a set of distributed servers, with the servers running locally or distributed over a network.

• Problem– Specification and Forces

• Software system that uses servers distributed over a network must provide a means of communication between them

• Separation of Functional/Communication code.• Service providers should be location independent.

Client-Dispatcher-Server Pattern

• Solution– Components

• Client – Carries out domain specific tasks

– Requests the dispatcher to set up a communication channel with the server.

• Server– Provides set of operations to the clients

– Registers with the dispatcher

• Dispatcher– Location transparency for the Servers.

Client-Dispatcher-Server patternStatic structure for Blast Filter

Client-Dispatcher-Server patternDynamic structure for Blast Filter

BioFilter ArchitectureFramework

• Splitting of the input data set and the database– Database splitting model is cumbersome.– Monolithic shared disk database with input data

splitting is easier.– Capacity computing model

• Exclusivity of partitions of the input data set • Capability to process one input partition at a time.

– Synchronization of results is an issue.

• Queuing System– Interaction with queuing system is part of the

architecture

BioFilter ArchitectureStatic structure-Blast Pipeline

BioFilter ArchitectureDynamic structure-Blast Pipeline

BioFilter Architecture

• Crash Recovery– TCP/IP based socket communication is blocking– Timed sockets overcome blocking– Timed waiting is predetermined but dynamically

computed dependent on the computational intensity of the job submitted in the communication between components.

• Interaction with nodes outside the cluster is possible by manually starting the server or by rsh utilities

Filter Variants

• Simple Filters.• Split Filters.• Join Filters

FastaRecordsFile BuildImmFilterFasta2TblFilterOrfFilter

TranslationFilterGlimmerFilter

Split Simple Join

Simple

Split

Concrete Source

Performance Results

• 240 node cluster, Each node is a Pentium III 1200 Mhz processor

• 20GB local disk and memory ranging from 1-2GB.• Shared disk via Netapps disk server.• Speedup Ratio for x nodes= Time to run the job on the

initial number of nodes/Time to run the job on x nodes

Performance ResultsBenchmark Test 1

• 1000 protein sequences from Bacteria (287 AA average length) blast against NR database.

• Servers ranging from 1-200

• Non-linear increase in performance– Shared Database

bottleneck– Single Dispatcher

Performance ResultsBenchmark Test 2 –Data Scalability

Performance ResultsBenchmark Test 3-It’s just not a Blast accelerator

Performance ResultsBenchmark Test 3

Future Work• Tests required

to analyse and cure the performance bottleneck.

• Forking the pipeline for added parallelism, requires a queue type of datastructure

ConclusionAdvantages

•Eliminate intermediate files, provides filter reuse and rapid prototyping of pipelines by filter recombination and exchange•Provides an environment for easy deployment of tools which are embarrasingly parallel on the Beowulf clusters. •Development time and Run time preformance gain. •Encapsulates the queuing system information in the implementation.•Interaction with nodes outside the cluster.•Provides exchangeabilty of servers, location and migration transperancy, reconfiguration of servers and fault tolerance

ConclusionDisadvantages

• Speed increase not proportional to the number of nodes.

•Absence of a GUI based interface.

• Architecture not tested with tools where the input data cannot be split.

•Error handling is very difficult.

•Pipeline disaster recovery is difficult.

•Client-Dispatcher-Server pattern makes the architecture slower.

Thanks and Possibilities

• Advisors– Dr Sun Kim (Indiana University-School of

Informatics)– Thomas Brettin (Los Alamos National Labs)

• Clients or Open Source.– California Digital.– Rocketcalc.– Microway.– Western Scientific.

References• S. Salzberg, A. Delcher, S. Kasif, and O. White.,"Microbial gene identification using interpolated Markov models"

Nucleic Acids Research 26:2 (1998), 544-548.• A.L. Delcher, D. Harmon, S. Kasif, O. White, and S.L. Salzberg., "Improved microbial gene identification with

GLIMMER" Nucleic Acids Research, 27:23, 4636-4641. • Lowe, T.M. & Eddy, S.R. (1997) ``tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic

sequence'', Nucl. Acids Res., 25, 955-964. • Altschul, S.F.,et al., "Basic local alignment search tool.". J Mol Bio. 215 403-410(1990)• Nakai K, Kanehisa M.,"Expert system for predicting protein localization sites in gram-negative bacteria." Proteins.

1991;11(2):95-110. • Lupas, A., Van Dyke, M., and Stock, J.,"Predicting Coled Coils from Protein Sequences", Science 252:1162-1164.• Wootton, J. C. and S. Federhen (1993)., "Statistics of local complexity in amino acid sequences and sequence

databases.", Computers in Chemistry 17:149-163. • Wootton, J. C. and S. Federhen (1996).," Analysis of compositionally biased regions in sequence databases.

Methods in Enzymology 266: 554-571. • Servant F, Bru C, Carrère S, Courcelle E, Gouzy J, Peyruc D, Kahn D (2002) ProDom: Automated clustering of

homologous domains. Briefings in Bioinformatics. vol 3, no 3:246-251 • J.G. Henikoff, E.A. Greene, S. Pietrokovski & S. Henikoff, "Increased coverage of protein families with the blocks

database servers", Nucl. Acids Res. 28:228-230 (2000). • S.Henikoff, J.G.Henikoff & S. Pietrokovski, "Blocks+: A non-redundant database of protein alignment blocks derived

from multiple compilations", Bioinformatics 15(6):471-479 (1999). • Sigrist C.J., Cerutti L., Hulo N., Gattiker A., Falquet L., Pagni M., Bairoch A., Bucher P., "PROSITE: a documented

database using patterns and profiles as motif descriptors." Brief Bioinform. 3:265-274(2002).• Mark Grand. "Patterns in Java .Volume 1" ,Second Edition,Wiley Publication Inc ,2002• Frank Buschmann, Regine Meunier, Hans Rohnert, Peter Sommerlad,Michael Stal."Pattern Oriented Software

Architecture Volume 1:A system of Patterns" Chicester,England:John Wiley and Sons, 1996.

Documents

BioFilter An architecture for parallel deployment and dynamic chaining of standalone BioInformatics tools. Master’s Thesis Avinash Kewalramani