14

Click here to load reader

Cloud Technologies for Bioinformatics Applications

  • Upload
    j

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Cloud Technologies for Bioinformatics Applications

Cloud Technologies forBioinformatics Applications

Jaliya Ekanayake, Student Member, IEEE, Thilina Gunarathne, Student Member, IEEE, and Judy Qiu

Abstract—Executing large number of independent jobs or jobs comprising of large number of tasks that perform minimal intertask

communication is a common requirement in many domains. Various technologies ranging from classic job schedulers to the latest

cloud technologies such as MapReduce can be used to execute these “many-tasks” in parallel. In this paper, we present our

experience in applying two cloud technologies Apache Hadoop and Microsoft DryadLINQ to two bioinformatics applications with the

above characteristics. The applications are a pairwise Alu sequence alignment application and an Expressed Sequence Tag (EST)

sequence assembly program. First, we compare the performance of these cloud technologies using the above applications and also

compare them with traditional MPI implementation in one application. Next, we analyze the effect of inhomogeneous data on the

scheduling mechanisms of the cloud technologies. Finally, we present a comparison of performance of the cloud technologies under

virtual and nonvirtual hardware platforms.

Index Terms—Distributed programming, parallel systems, performance, programming paradigms.

Ç

1 INTRODUCTION

THERE is an increasing interest in approaches to dataanalysis in scientific computing as essentially every

field is seeing an exponential increase in the size of the datadeluge. The data sizes imply that parallelism is essential toprocess the information in a timely fashion. This isgenerating justified interest in new runtimes and program-ming models that, unlike traditional parallel models (suchas MPI), directly address the data-specific issues. Experi-ence has shown that the initial (and often most time-consuming) parts of data analysis are naturally data paralleland the processing can be made independent with perhapssome collective (reduction) operation. This structure hasmotivated the important MapReduce [1] paradigm andmany follow-on extensions. Here, we examine two tech-nologies (Microsoft Dryad/DryadLINQ [2], [3] and ApacheHadoop [4]) on two different bioinformatics applications(EST [5], [6] and Alu clustering [7], [8]) and for the later wealso present details of an MPI implementation as well.Dryad is an implementation of extended MapReduce fromMicrosoft. All the applications are (as is often so in Biology)“doubly data parallel” (or “all pairs” [9]) as the basiccomputational unit is replicated over all pairs of data itemsfrom the same (in our cases) or different data sets. In theEST example, each parallel task executes the CAP3 programon an input data file independently of others and there is no“reduction” or “aggregation” necessary at the end of thecomputation. On the other hand, in the Alu case, a global

aggregation is necessary at the end of the independentcomputations to produce the resulting dissimilarity matrix.In this paper, we evaluate different technologies showingthat they give comparable performance despite theirdifferent programming models.

1.1 Cloud and Cloud Technologies

Cloud and cloud technologies are two broad categories oftechnologies related to the general notion of CloudComputing. By “Cloud,” we refer to a collection ofinfrastructure services such as Infrastructure-as-a-service(IaaS), Platform-as-a-Service (PaaS), etc., provided byvarious organizations where virtualization usually plays akey role. By “Cloud Technologies,” we refer to variouscloud runtimes such as Hadoop, Dryad, and otherMapReduce frameworks, and also the storage and commu-nication frameworks such as Hadoop Distributed FileSystem (HDFS), Amazon S3, etc. Although, we name themas cloud technologies, runtimes such as Hadoop and Dryadcan be run on local resources as well.

1.2 Relevance of MapReduce to Many-TaskComputing

MapReduce programming model comprises of two compu-tation steps (map/reduce) and an intermediate datashuffling step, which can be used to implement manyparallel applications. The three steps collectively providefunctionality beyond the requirements of simple many-taskcomputations (MTC). However, this does not hinder theusability of MapReduce to MTC applications. For example,when only a map operation is used, the MapReduceprogramming model reduces to a “map-only” version thatis an ideal match for MTC applications. Furthermore, withthe capability of adding a “reduction” operation MapRe-duce can be used to collect or merge results of theembarrassingly parallel phase of some of the MTC applica-tions as well. Our research on applying MapReduce tocomputations was motivated by these observations.

998 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

. The authors are with the School of Informatics and Computing andPervasive Technology Institute, Indiana University, Bloomington, IN47408. E-mail: {jekanaya, tgunarat}@cs.indiana.edu, [email protected].

Manuscript received 5 Jan. 2010; revised 3 July 2010; accepted 20 Aug. 2010;published online 18 Oct. 2010.Recommended for acceptance by I. Raicu, I.T. Foster, and Y. Zhao.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPDSSI-2010-01-0021.Digital Object Identifier no. 10.1109/TPDS.2010.178.

1045-9219/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

Page 2: Cloud Technologies for Bioinformatics Applications

In Section 2, we give a brief introduction to the two

cloud technologies we used while the applications EST and

Alu sequencing are discussed in Section 3. Section 4

presents some performance results. Conclusions are given

in Section 7 after a discussion of the different programming

models in Section 5 and related work in Section 6.

2 CLOUD TECHNOLOGIES

2.1 Dryad/DryadLINQ

Dryad is a distributed execution engine for coarse grain

data parallel applications. It combines the MapReduce

programming style with data-flow graphs to solve the

computation tasks. Dryad considers computation tasks as

directed acyclic graph (DAG) where the vertices represent

computation tasks and with the edges acting as commu-

nication channels over which the data flow from one vertex

to another. The data are stored in (or partitioned to) local

disks via the Windows shared directories and metadata

files and Dryad schedules the execution of vertices

depending on the data locality. (Note: The academic release

of Dryad only exposes the DryadLINQ API for program-

mers [3], [10]. Therefore, all our implementations are

written using DryadLINQ although it uses Dryad as the

underlying runtime). Dryad also stores the output of

vertices in local disks, and the other vertices which depend

on these results, access them via the shared directories. This

enables Dryad to reexecute failed vertices, a step which

improves the fault tolerance of the programming model.

The current academic release of DryadLINQ runs only on

the Windows HPC Server 2008 operation system.

2.2 Apache Hadoop

Apache Hadoop has a similar architecture to Google’s

MapReduce [1] runtime. Hadoop accesses data via HDFS

[4], which maps all the local disks of the compute nodes to a

single-file system hierarchy, allowing the data to be

dispersed across all the data/computing nodes. HDFS also

replicates the data on multiple nodes so that failures of

nodes containing a portion of the data will not affect the

computations which use that data. Hadoop schedules the

MapReduce computation tasks depending on the data

locality, improving the overall I/O bandwidth. The outputs

of the map tasks are stored in local disks until the reduce

tasks access them (pull) via HTTP connections. Although

this approach simplifies the fault handling mechanism in

Hadoop, it adds a significant communication overhead to

the intermediate data transfers, especially for applications

that produce small intermediate results frequently. Apache

Hadoop runs only on Linux operating systems.

3 APPLICATIONS

3.1 Alu Sequence Classification

The Alu clustering problem [8] is one of the mostchallenging problems for sequence clustering because Alusrepresent the largest repeat families in human genome.There are about 1 million copies of Alu sequences in humangenome, in which most insertions can be found in otherprimates and only a small fraction (�7;000) is human-specific. This indicates that the classification of Alu repeatscan be deduced solely from the 1 million human Aluelements. Alu clustering can be viewed as a classical casestudy for the capacity of computational infrastructuresbecause it is not only of great intrinsic biological interest,but also a problem of a scale that will remain as the upperlimit of many other clustering problems in bioinformaticsfor the next few years, such as the automated protein familyclassification for millions of proteins. In our previousworks, we have examined Alu samples of 35,339 and50,000 sequences using the pipeline of Fig. 1.

3.1.1 Complete Alu Application

This application uses two highly parallel traditional MPIapplications, i.e., Multidimensional Scaling (MDS) andPairwise (PW) Clustering algorithms described in Fox et al.[7]. The latter identifies relatively isolated sequence familiesas shown in the example depicated in Fig. 2. MDS allowsvisualization by mapping the high-dimension sequence datato lower (in our case, 3) dimension for visualization. MDSfinds the best set of 3D vectors x(i) such that a weightedleast-squares sum of the difference between the sequencedissimilarity D(i,j) and the euclidean distance jxðiÞ-xðjÞj isminimized. This has a computational complexity ofOðN2Þ tofind 3N unknowns for N sequences.

The PWClustering algorithm is an efficient MPI paralle-lization of a robust Expectation Maximization (EM) methodusing annealing (deterministic not Monte Carlo) originallydeveloped by Ken Rose, Fox [14], [15] and others [16]. This

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 999

Fig. 1. Pipeline for analysis of sequence data.

Fig. 2. Display of Alu clusters from MDS and clustering calculation from

35,339 sequences using SW-G distances. The clusters corresponding

to younger families AluYa, AluYb are particularly tight.

Page 3: Cloud Technologies for Bioinformatics Applications

improves over other clustering methods, such as Kmeans,which are sensitive to false minima. The original clusteringwork was based on a vector space model (like Kmeans)where a cluster is defined by a vector as its center. However,in a major advance 10 years ago [16], it was shown how onecould use a vector-free approach and operate with just thedistances D(i,j). This method is clearly the most natural forproblems like Alu sequence analysis where current globalsequence alignment approaches (over all N sequences) areproblematic, but D(i,j) can be precisely calculated for eachpair of sequences. PWClustering also has a time complexityof OðN2Þ and in practice we find all three steps (CalculateD(i,j), MDS, and PWClustering) take comparable times (afew hours for 50,000 sequences on 768 cores) althoughsearching for a large number of clusters and refining theMDS can increase their execution time significantly. Wehave presented performance results for MDS and PWClus-tering elsewhere [7], [12] and for large data sets theefficiencies are high (showing sometimes super linearspeedup). In the rest of the paper, we only discuss theinitial dissimilarity computation.

As shown in Fig. 1, the initial step of the pipeline is tocompute the pairwise dissimilarities between each pair ofgenes in the data set. Since the comparison happens withthe same set of genes, effectively we only need to performhalf of the computations. However, the latter two stages(Clustering and MDS) require the output of the first stage tobe in a format appropriate for the later MPI-based datamining stages. The MDS and PWClustering algorithmsrequire a particular parallel decomposition where each of Nprocesses (MPI processes and threads) has 1/N ofsequences and for this subset {i} of sequences stores inmemory D({i},j) for all sequences j and the subset {i} ofsequences for which this node is responsible. This impliesthat often we need D(i,j) and D(j,i) (which are equal) storedin different processors/disks. We designed our initialcalculation of D(i,j) so that we only calculated theindependent set, but the data were stored so that the laterMPI jobs could access the data needed.

3.1.2 Smith Waterman Dissimilarities

We identified samples of the human and Chimpanzee Alugene sequences using Repeatmasker [13] with RepbaseUpdate [14]. We used an open source implementation,named NAligner [11], of the Smith Waterman Gotohalgorithm SW-G [15], [16] modified to ensure low start-upeffects by each thread, processing a large number (above afew hundred) of sequence calculations at a time. Memorybandwidth needed was reduced by storing data items in asfew bytes as possible. In the following two sections, wediscuss the efficient usage of MapReduce, DryadLINQ, andMPI for the initial phase of calculating distances D(i,j) foreach pair of sequences.

3.1.3 Parallelizing SW-G as a Many-Task Computation

In this section, we discuss two approaches to partition theworkload of the SW-G application as a many-task computa-tion. Both these approaches use block-based coarse graintask decomposition to minimize task scheduling overheadsas individual comparisons take significantly less time.

3.1.4 First Approach: Considering the Upper Triangular

(TU) Blocks

In this approach, the overall pairwise computation isdecomposed into tasks by considering the output matrix.To clarify our algorithm, let’s consider an example where Ngene sequences produce a pairwise distance matrix of sizeN�N. We decompose the computation task by consideringthe resultant matrix and group the overall computation into ablock matrix of size D�D where D is a multiple (>2) of theavailable computation nodes and d ¼ ðN=DÞ. Due to thesymmetry of the distances D(i,j) and D(j,i), we only calculatethe distances in the blocks of the upper triangle of the blockmatrix as shown in Fig. 3a. Diagonal blocks are speciallyhandled and calculated as full subblocks. As the number ofdiagonal blocks is D and total number is DðDþ 1Þ=2, there isno significant compute overhead added. Once the blockboundaries are defined, the upper triangular blocks areprocessed separately resembling an MTC application. EachUT block produces two outputs; 1) the pairwise distancescorresponding to the genes in the block’s boundary and 2) thetranspose of the same to represent the correspondinguncalculated block in the lower triangle. The diagonal blocksdo not produce duplicates as both the output and thetranspose are the same.

3.1.5 Second Approach: Considering the Symmetry of

Blocks

A load-balanced task partitioning strategy for all pairsproblems can be devised using the following rules, whichwe use to identify the blocks that need to be computed as

1000 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

Fig. 3. Two task decomposition approaches considering the final output

as a block matrix. (a) using only the upper triangle of the output matrix,

(b) using the symmetry of the calculations.

Page 4: Cloud Technologies for Bioinformatics Applications

shown in the Fig. 3b. In addition, similar to the firstpartitioning approach, all the blocks in the diagonal needsto be computed due to the usage of blocks. Except for thediagonal blocks, all the other computation blocks producedistances for itself and transpose.

When � >¼ �;we calculate Dð�; �Þ only if �þ � is even;When � < �; we calculate Dð�; �Þ only if �þ � is odd:

Irrespective of the adopted decomposition approache, thepairwise SW-G calculations yield the same amount ofcalculations in both approaches and they both can beexecuted as an MTC application. Next, we will discuss howwe map these algorithms to different runtimes.

3.1.6 DryadLINQ Implementation

We used DryadLINQ’s “Apply” operation to execute afunction to calculate ðN=DÞ � ðN=DÞ distances in eachblock. The function internally uses the NAligher (C# versionof JAligner [20]) library to calculate SW-G dissimilarity foreach pair of genes. After computing the distances in eachblock, the function calculates the transpose matrix of theresult matrix which corresponds to a block in the lowertriangle, and writes both these matrices into two output filesin the local file system. The names of these files and theirblock numbers are communicated back to the mainprogram. The main program sorts the files based on theirblock numbers and performs another “Apply” operation tocombine the files corresponding to rows in block matrix asshown in Fig. 4. The first step of this computationdominates the overall running time of the application, andwith the algorithm explained it clearly resembles thecharacteristics of a “many-task” problem.

3.1.7 Hadoop Implementation

We developed an Apache Hadoop version of the pairwisedistance calculation program based on the JAligner [20]program the java implementation of the NAligner used in theDryad implementation. Similar to the other implementa-tions, the computation is partitioned into blocks based on theresultant distance matrix. Each of the blocks would get

computed as a map task. The block size (D) can be specifiedvia an argument to the program. The block size needs to bespecified in such a way that there will be much more maptasks than the map task capacity of the system, so that theApache Hadoop scheduling will happen as a pipeline of maptasks resulting in global load balancing inside the applica-tion. The input data are distributed to the worker nodesthrough the Hadoop distributed cache, which makes themavailable in the local disk of each compute node. Hadoopuses the block symmetry-based second partitioning ap-proach we described above to identify the blocks to compute.

Fig. 5 depicts the runtime behavior of the Hadoop SW-Gprogram. In the given example, the map task capacity of thesystem is “k” and the number of blocks is “N.” The solidblack lines represent the starting state, where “k” map taskscorresponding to “k” computation blocks will get scheduledin the compute nodes. The dashed black lines represent thestate at time t1, when two map tasks, m2 and m6, getcompleted and two map tasks from the pipeline getscheduled for the placeholders emptied by the completedmap tasks. The gray dotted lines represent the future.

Map tasks use custom Hadoop writable objects as themap task output values to store the calculated pairwisedistance matrices for the respective blocks. In addition,nondiagonal map tasks output the inverse of the distancesmatrix of the block as a separate output value. Hadoop useslocal files and http transfers to transfer the map task outputkey-value pairs to the reduce tasks.

The outputs of the map tasks are collected by the reducetasks. Since the reduce tasks start collecting the map taskoutputs as soon as the first map task completes the executionand continue to do so while other map tasks are executing, thedata transfers from the map tasks to reduce tasks do notpresent a significant performance overhead to the programdue to the overlapping of computation with I/O. Theprogram currently creates a single reduce task per each rowblock resulting in total of (no. of sequences/block size)

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 1001

Fig. 4. DryadLINQ implementation of SW-G pairwise distance

calculation.

Fig. 5. Hadoop implementation SW-G pairwise distance calculation

application.

Page 5: Cloud Technologies for Bioinformatics Applications

Reduce tasks. Each reduce task accumulates the outputdistances for a row block and writes the collected output to asingle file in HDFS. This results in N number of output filescorresponding to each row block, similar to the output weproduce in the Dryad version.

3.1.8 MPI Implementation

The MPI version of SW-G calculates pairwise distancesusing a set of either single or multithreaded processes. ForN gene sequences, we need to compute half of the values (inthe lower triangular matrix), which is a total of M ¼N � ðN � 1Þ=2 distances. At a high level, computation tasksare evenly divided among P processes and execute inparallel. Namely, computation workload per process isM=P . At a low level, each computation task can be furtherdivided into subgroups and run in T concurrent threads.Our implementation is designed for flexible use of sharedmemory multicore system and distributed memory clusters(tight to medium tight coupled communication technolo-gies such threading and MPI). We provide options for anycombinations of thread versus process versus node, but inearlier papers [7], [12], we have shown that threading ismuch slower than MPI for this class of problems.

We have implemented the above blocked algorithm inMPI as well. Points are divided into blocks such that eachprocessor is responsible for all blocks in a simple decom-position illustrated in Fig. 6. This also illustrates the initialcomputation, where to respect symmetry, we calculate halfthe Dð�; �Þ using the same criterion used in Hadoopimplementation:

This approach can be applied to points or blocks. In ourimplementation, we applied it to blocks of points of sizeðN=P Þ � ðN=P Þ where we use P MPI processes. Note thatwe get better load balancing than the “Space Filling”algorithm as each processor samples all values of �. Thiscomputation step must be followed by a communicationstep illustrated in Fig. 6 which gives full strips in eachprocess. The latter can be straightforwardly written out asproperly ordered file(s).

3.1.9 Discussion

Although, the above three implementations use almostthe same algorithm for the SW-G problem, depending on theruntime used we were forced to use some of the runtimefeatures when implementing them. For example, in Hadoop,the data are read from its distributed file system HDFS whilewe used local disks in DryadLINQ and for MPI. Similarly,the different runtimes follow different task schedulingapproaches as well. In MPI, the tasks are statically scheduledto the available CPU cores. DryadLINQ schedules tasks tothe computation nodes statically but internal to nodes, thetasks are scheduled to different CPU cores dynamically.Hadoop, on the other hand, schedules tasks to CPU coresdynamically. These differences contributed to differentruntime performances. However, our motivation in thispaper is to demonstrate the applicability of these runtimes totwo biology applications that represent the class of “many-task” computations and discuss the performance character-istics one may get in using them.

3.2 CAP3 Application EST and Its Software CAP3

3.2.1 EST and Its Software CAP3

An Expressed Sequence Tag (EST) corresponds to messen-ger RNAs (mRNAs) transcribed from the genes residing onchromosomes. Each individual EST sequence represents afragment of mRNA, and the EST assembly aims toreconstruct full-length mRNA sequences for each expressedgene. Because ESTs correspond to the gene regions of agenome, EST sequencing has become a standard practice forgene discovery, especially for the genomes of manyorganisms that maybe too complex for whole-genomesequencing. EST is addressed by the software CAP3 whichis a DNA sequence assembly program developed by Huangand Madan [17]. CAP3 performs several major assemblysteps including computation of overlaps, construction ofcontigs, construction of multiple sequence alignments, andgeneration of consensus sequences to a given set of genesequences. The program reads a collection of genesequences from an input file (FASTA file format) andwrites its output to several output files, as well as thestandard output.

CAP3 is often required to process large numbers ofFASTA formatted input files, which can be processedindependently, making it an embarrassingly parallel appli-cation requiring no interprocess communications. We haveimplemented a parallel version of CAP3 using Hadoop andDryadLINQ. In both these implementations, we adopt thefollowing algorithm to parallelize CAP3.

1. Distribute the input files to storages used by theruntime. In Hadoop, the files are distributed toHDFS, and in DryadLINQ, the files are distributedto the individual shared directories of the compu-tation nodes.

2. Instruct each parallel process (in Hadoop, these arethe map tasks and in DryadLINQ these are thevertices) to execute CAP3 program on each input file.

This application resembles a common parallelizationrequirement, where an executable, a script, or a function ina special framework such as Matlab or R, needs to be

1002 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

Fig. 6. MPI implementation SW-G pairwise distance calculation

application.

Page 6: Cloud Technologies for Bioinformatics Applications

applied on a collection of data items. We can developDryadLINQ and Hadoop applications similar to the CAP3implementations for all these use cases. One can use MPIto implement a parallel version of CAP3. However, thenthe application needs to handle initial file distribution andfinal results collection manually. Since CAP3 is executedas a stand-alone application by each parallel process inMPI, even with the above extra work it would not yieldany performance gains. Therefore, we did not implementCAP3 using MPI but rather focused on Hadoop andDryadLINQ.

3.2.2 DryadLINQ Implementation

As discussed in Section 3.2.1, CAP3 is a stand-aloneexecutable that processes a single file containing DNAsequences. To implement a parallel CAP3 applicationusing DryadLINQ, we adopted the following approaches:1) the input files are partitioned among the nodes of thecluster so that each node of the cluster stores roughly thesame number of input files; 2) a “data-partition” (A textfile for this application) is created in each node containingthe names of the input files available in that node; 3) aDryadLINQ “partitioned-file” (a metadata file understoodby DryadLINQ) is created to point to the individual data-partitions located in the nodes of the cluster. These threesteps enable DryadLINQ programs to execute queriesagainst all the input data files. Once this is done, theDryadLINQ program which performs the parallel CAP3execution is just a single-line program contacting a“Select” query which select each input file name fromthe list of file names and execute a user-defined functionon that. In our case, the user-defined function calls theCAP3 program passing the input file name as programarguments. The function also captures the standardoutput of the CAP3 program and saves it to a file. Then,it moves all the output files generated by CAP3 to apredefined location.

3.2.3 Hadoop Implementation

Parallel CAP3 sequence assembly fits as a “map only”application for the MapReduce model. The Hadoopapplication is implemented by writing map tasks whichexecute the CAP3 program as a separate process on a giveninput FASTA file. Since the CAP3 application is implemen-ted in C, we do not have the luxury of using the Hadoop filesystem directly from the program. Hence, the individualdata files are downloaded to the nodes using a customHadoop InputFormat and a custom RecordReader, whilepreserving the data locality needs to be stored in a sharedfile system across the nodes.

4 PERFORMANCE ANALYSIS

In this section, we study the performance of SW-G andCAP3 applications under increasing homogeneous work-loads, inhomogeneous workloads with different standarddeviations, and the performance in cloud-like virtualenvironments. A 32 nodes IBM iDataPlex cluster, with eachnode having 2 quad core Intel Xeon processors (total eightcores per node) and 32 GB of memory per node was usedfor the performance analysis under the following operating

conditions: 1) Microsoft Window HPC Server 2008, servicePack 1-64 bit; 2) Red Hat Enterprise Linux Server release5.3-64 bit on bare metal; and 3) Red Hat Enterprise LinuxServer release 5.3-64 bit on Xen hypervisor (version 3.0.3).

In our performance measures, we are focusing onDryadLINQ and Hadoop and the various performanceissues that one would encounter in using these runtimes.The two runtimes does not provide identical interfaces andtherefore we had to use different features depending on theruntime. For example, in DryadLINQ, the files are accessedfrom shared directories while Hadoop uses HDFS. Further,Hadoop only runs on Linux while DryadLINQ requireWindows HPC Server 2008. Apart from the above, we couldnot use DryadLINQ for the study of the performanceimplications of virtualization simply because the WindowsHPC Server 2008 is not yet available for XEN. However, ourmotivation in this research is to develop and deployapplications using the best strategy for each runtime andanalyze their performances to see what benefits one cangain from using these cloud technologies.

4.1 Scalability of Different Implementations

4.1.1 SW-G

In order to compare the scalability of Dryad, Hadoop, andMPI implementations of ALU SW-G distance calculationswith the increase of the data size using data sets of 10,000 to40,000 sequences. These data sets correspond to 100 millionto 1.6 billion total resultant sequence distances. The actualnumber distance calculations performed by the applicationsare about half the above numbers due to optimizationsmentioned in the implementation section. Data sets weregenerated by taking a 10,000 sequence random sample froma real data set and replicating it two to four times. TheDryad and MPI results were adjusted to counter theperformance difference of the kernel programs for faircomparison with the Hadoop implementation. NAligner onwindows performs on average �.78 times slower thanJAligner on Linux in the hardware we used for theperformance analysis.

The results for this experiment are given in Fig. 7. Thetime per actual calculation is computed by dividing the total

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 1003

Fig. 7. Scalability of Smith Waterman pairwise distance calculation

applications

Page 7: Cloud Technologies for Bioinformatics Applications

time to calculate pairwise distances for a set of sequences bythe actual number of comparisons performed by theapplication. According to Fig. 7, all three implementationsperform and scale satisfactorily for this application withHadoop implementation showing the best scaling. Asexpected, the total running times scaled proportionally tothe square of the number of sequences. The Hadoop andDryad applications perform and scale competitively withthe MPI application.

We can notice that the performance of the Hadoopimplementation improving with the increase of the data setsize, while Dryad performance degrades a bit. Hadoopimprovements can be attributed to the diminishing of theframework overheads, while the Dryad degradation canbe attributed to the memory management issues in theWindows and Dryad environment.

4.1.2 CAP3

We analyzed the scalability of DryadLINQ and Hadoopimplementations of the CAP3 application with the increaseof the data set using homogeneous data sets. We preparethe data sets by replicating a single fasta file to represent auniform workload across the application. The selected fastasequence file contained 458 sequences.

The results are shown in Fig. 8. The primary vertical axis(left) shows the total time versus the number of files.Secondary axis (right) shows the time taken per file (totaltime/number of files) against the number of files. Both theDryadLINQ and Hadoop implementations show goodscaling for the CAP3 application, although Dryad’s scalingis not as smooth as the Hadoop scaling curve. Stand-aloneCAP3 application used as the kernel for these applicationsperforms better in the windows environment than in theLinux environment, which must be contributing to thereason for Hadoop being slower than Dryad.

4.2 Inhomogeneous Data Analysis

New generation parallel data processing frameworks suchas Hadoop and DryadLINQ are designed to performoptimally when a given job can be divided in to a set ofequally time consuming subtasks. Most of the data sets weencounter in the real world, however, are inhomogeneousin nature, making it hard for the data analyzing programs to

efficiently break down the problems into equal subtasks. Atthe same time, we noticed Hadoop and DryadLINQ exhibitdifferent performance behaviors for some of our real datasets. It should be noted that Hadoop and Dryad usedifferent task scheduling techniques, where Hadoop usesglobal queue-based scheduling and Dryad uses staticscheduling. These observations motivated us to study theeffects of data inhomogeneity in the applications imple-mented using these frameworks.

4.2.1 SW-G Pairwise Distance Calculation

The inhomogeneity of data applies for the gene sequence setstoo, where individual sequence lengths and the contents varyamong each other. In this section, we study the effect ofinhomogeneous gene sequence lengths for the performanceof our pairwise distance calculation applications.

SWGðA;BÞ ¼ OðmnÞ:

The time complexity to align and obtain distances for twogenome sequences A and B with lengths m and n,respectively, using Smith-Waterman-Gotoh algorithm isapproximately proportional to the product of the lengthsof two sequences ðOðmnÞÞ. All the above-describeddistributed implementations of Smith-Waterman similaritycalculation mechanisms rely on block decomposition tobreak down the larger problem space into subproblemsthat can be solved using the distributed components. Eachblock is assigned two subsets of sequences, where Smith-Waterman pairwise distance similarity calculation needs tobe performed for all the possible sequence pairs among thetwo subsets. According to the above-mentioned timecomplexity of the Smith-Waterman kernel used by thesedistributed components, the execution time for a particularexecution block depends on the lengths of the sequencesassigned to the particular block.

Parallel execution frameworks like Dryad and Hadoopwork optimally when the work is equally partitionedamong the tasks. Depending on the scheduling strategy ofthe framework, blocks with different execution times canhave an adverse effect on the performance of the applica-tions, unless proper load balancing measures have beentaken in the task partitioning steps. For an example, inDryad, vertices are scheduled at the node level, making itpossible for a node to have blocks with varying executiontimes. In this case, if a single block inside a vertex takes alonger amount of time than other blocks to execute, then theentire node must wait until the large task completes, whichutilizes only a fraction of the node resources.

For the inhomogeneous data study, we decided to usecontrolled inhomogeneous input sequence sets with sameaverage length and varying standard deviation of lengths.It’s hard to generate such controlled input data sets usingreal sequence data as we do not have control over the lengthof real sequences. At the same time, we note that theexecution time of the Smith-Waterman pairwise distancecalculation depends mainly on the lengths of the sequencesand not on the actual contents of the sequences. Thisproperty of the computation makes it possible for us toignore the contents of the sequences and focus only on thesequence lengths, thus making it possible for us to use

1004 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

Fig. 8. Scalability of Cap3 applications.

Page 8: Cloud Technologies for Bioinformatics Applications

randomly generated gene sequence sets for this experiment.The gene sequence sets were randomly generated for agiven mean sequence length (400) with varying standarddeviations following a normal distribution of the sequencelengths. Each sequence set contained 10,000 sequencesleading to 100 million pairwise distance calculations toperform. We performed two studies using such inhomoge-neous data sets. In the first study, the sequences withvarying lengths were randomly distributed in the data sets.In the second study, the sequences with varying lengthswere distributed using a skewed distribution, where thesequences in a set were arranged in the ascending order ofsequence length.

Fig. 9 presents the execution time taken for the randomlydistributed inhomogeneous data sets with the same meanlength, by the two different implementations, while Fig. 10presents the executing time taken for the skewed distrib-uted inhomogeneous data sets. The Dryad results depictthe Dryad performance adjusted for the performancedifference of the NAligner and JAligner kernel programs.As we notice from Fig. 9, both implementations performsatisfactorily for the randomly distributed inhomogeneousdata, without showing significant performance degrada-tions with the increase of the standard deviation. Thisbehavior can be attributed to the fact that the sequenceswith varying lengths are randomly distributed across a dataset, effectively providing a natural load balancing to theexecution times of the sequence blocks.

For the skewed distributed inhomogeneous data, wenotice clear performance degradation in the Dryad im-plementation. Once again, the Hadoop implementationperforms consistently without showing significant perfor-mance degradation, even though it does not perform as wellas its randomly distributed counterpart. The Hadoopimplementations’ consistent performance can be attributedto the global pipeline scheduling of the map tasks. In theHadoop Smith-Waterman implementation, each blockdecomposition gets assigned to a single map task. Hadoopframework allows the administrator to specify the numberof map tasks that can be run on a particular compute node.The Hadoop global scheduler schedules the map tasksdirectly onto those placeholders in a much finer granularity

than in Dryad, as and when the individual map tasks finish.This allows the Hadoop implementation to perform naturalglobal load balancing. In this case, it might even beadvantageous to have varying task execution times to ironout the effect of any trailing map tasks toward the end ofthe computation. Dryad implementation preallocates all thetasks to the compute nodes and does not perform anydynamic scheduling across the nodes. This makes a nodewhich gets a larger work chunk to take considerable longertime than a node which gets a smaller work chunk, makingthe node with a smaller work chuck to idle while the othernodes finish.

4.2.2 CAP3

Unlike in Smith-Waterman Gotoh (SW-G) implementations,CAP3 program execution time does not directly depend onthe file size or the size of the sequences, as it depend mainlyon the content of the sequences. This made it hard for us toartificially generate inhomogeneous data sets for the CAP3program, forcing us to use real data. When generating thedata sets, first we calculated the stand-alone CAP3 execu-tion time for each of the files in our data set. Then, based onthose timings, we created data sets that have approximatelysimilar mean times while the standard deviation of thestand-alone running times is different in each data set. Weperformed the performance testing for randomly distrib-uted as well as skewed distributed (sorted according toindividual file running time) data sets similar to the SWGinhomogeneous study. The speedup is taken by dividingthe sum of sequential running times of the files in the dataset by the parallel implementation running time.

Figs. 11 and 12 depict the CAP3 inhomogeneousperformance results for Hadoop and Dryad implementa-tions. Hadoop implementation shows satisfactory scalingfor both randomly distributed as well as skewed distributeddata sets, while the Dryad implementation shows satisfac-tory scaling in the randomly distributed data set. Onceagain, we notice that the Dryad implementation does notperform well for the skewed distributed inhomogeneousdata due to its static nonglobal scheduling.

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 1005

Fig. 9. Performance of SW-G pairwise distance calculation application

for randomly distributed inhomogeneous data with “400” mean

sequence length. Fig. 10. Performances of SW-G pairwise distance calculation application

for skewed distributed inhomogeneous data with “400” mean sequence

length.

Page 9: Cloud Technologies for Bioinformatics Applications

4.2.3 Discussion on Inhomogeneity

Many real-world data sets and problems are inhomogeneousin nature making it difficult to divide those computations intoequally balanced computational parts. But at the same time,most of the inhomogeneity of problems are randomlydistributed providing a natural load balancing inside thesubtasks of a computation. We observed that the schedulingmechanism employed by both Hadoop (dynamic) andDryadLINQ (static) perform well when randomly distribu-ted inhomogeneous data are used. Also in the above study,we observed that given there are sufficient map tasks, theglobal queue-based dynamic scheduling strategy adopted byHadoop provide load balancing even in extreme scenarioslike skewed distributed inhomogeneous data sets. The staticpartition-based scheduling strategy of DryadLINQ does nothave the ability to load balance such extreme scenarios. It ispossible, however, for the application developers to rando-mize such data sets before using them with DryadLINQ,which will allow such applications to achieve the natural loadbalancing of randomly distributed inhomogeneous data setswe described above.

4.3 Performance in the Cloud

With the popularity of the computing clouds, we cannotice the data processing frameworks like Hadoop, MapReduce, and DryadLINQ are becoming popular as cloudparallel frameworks. We measured the performance and

virtualization overhead of several MPI applications on thevirtual environments in an earlier study [24]. Here, wepresent extended performance results of using ApacheHadoop implementations of SW-G and Cap3 in a cloudenvironment by comparing Hadoop on Linux withHadoop on Linux on Xen [26] paravirtualied environment.

While the Youseff et al. [27] suggests that the VM’simpose very little overheads on MPI application, ourprevious study indicated that the VM overheads dependmainly on the communications patterns of the applications.Specifically the set of applications that is sensitive tolatencies (lower communication to computation ration,and large number of smaller messages) experienced higheroverheads in virtual environments. Walker [28] presentsbenchmark results of the HPC application performance onAmazon EC2, compared with a similar bare metal localcluster, where he noticed 40 to 1,000 percent performancedegradations on EC2. But since one cannot have completecontrol and knowledge over EC2 infrastructure, there existstoo many unknowns to directly compare these results withthe above-mentioned results.

4.3.1 SW-G Pairwise Distance Calculation

Fig. 13 presents the virtualization overhead of the HadoopSW-G application comparing the performance of the applica-tion on Linux on bare metal and on Linux on Xen virtualmachines. The data sets used is the same 10,000 real sequencereplicated data set used for the scalability study in Sec-tion 4.1.1. The number of blocks is kept constant across thetest, resulting in larger blocks for larger data sets. Accordingto the results, the performance degradation for the HadoopSWG application on virtual environment ranges from 25 to 15percent. We can notice the performance degradation getsreduced with the increase of the problem size.

In the xen paravirtualization architecture, each guest OS(running in domU) perform their I=O transfers throughXen (dom0). This process adds start-up costs to I=O as itinvolves start-up overheads such as communication withdom0 and scheduling of I=O operations in dom0. Xenarchitecture uses shared memory buffers to transfer databetween domU’s and dom0, thus reducing the operationaloverheads when performing the actual I=O. We can noticethe same behavior in the Xen memory management,where page table operations needs to go through Xen,

1006 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

Fig. 11. Performance of Cap3 application for random distributed

inhomogeneous data.

Fig. 12. Performance of Cap3 Applications for skewed distributed

inhomogeneous data.

Fig. 13. Virtualization overhead of Hadoop SW-G on Xen virtual

machines.

Page 10: Cloud Technologies for Bioinformatics Applications

while simple memory accesses can be performed by theguest Oss without Xen involvement. According to theabove points, we can notice that doing few coarser grainedI=O and memory operations would incur relatively lowoverheads than doing the same work using many finergrained operations. We can conclude this as the possiblereason behind the decrease of performance degradationwith the increase of data size, as large data sizes increasethe granularity of the computational blocks.

4.3.2 CAP3

Fig. 14 presents the virtualization overhead of the HadoopCAP3 application. We used the same scalability data set weused in Section 4.1.2 for this analysis too. The performancedegradation in this application remains constant near 20percent for all the data sets. CAP3 application does not showthe decrease of VM overhead with the increase of problem sizeas we noticed in the SWG application. Unlike in SWG, the I=Oand memory behavior of the CAP3 program does not changebased on the data set size, as irrespective of the data set size thegranularity of the processing (single file) remains same.Hence, the VM overheads do not get changed even with theincrease of workload.

4.3.3 Fluctuations in Results

Since many experiments performed to obtain the resultspresented in this paper are long running, we did not have theluxury of performing each experiment multiple times tocalculate errors for each single experiment due to resourceand time limitations. In order to compensate for that, we ran asingle representative experiment from each of the application20 times to find the systematic errors and fluctuations of theapplications, which are presented in Table 1. According toTable 1, we can notice the systematic errors to be very minimalfor the experiments we performed. For the SWG applications,we performed Pairwise alignment on 15,000 sequences,resulting in approximately 114 million actual sequencealignments. For the Cap3 applications, we performed Cap3sequence assembly on 1,024 Fasta files. It should be noted thatthe systematic timing errors due to minor system fluctuationsdo not have a significant effect on the overall running time of asingle experiment due to the long running nature of theexperiments. All the other experiment results presented inthe paper are performed at least twice and the average timewas taken.

5 COMPARISON OF PROGRAMMING MODELS

The category of “many-task computing” also belongs to thepleasingly parallel applications in which the parallel tasksin a computation perform minimum intertask communica-tions. From our perspective, independent jobs or jobs withcollection of almost independent tasks represents the“many-tasks” domain. Apart from the EST and pairwisedistance computations we have described, applicationssuch as parametric sweeps, converting documents todifferent formats, and brute-force searches in cryptographyare all examples in this category of applications. In thissection, we compare the programming styles of the threeparallel runtimes that we have evaluated. The motivationbehind this comparison is to show the reader the differentstyles of programming supported by the above runtimesand the programming models one needs to use withdifferent runtimes. This is by no means an exhaustivestudy of programming techniques.

One can adopt a wide range of parallelization techniquesto perform most of the many-task computations in parallel.Thread libraries in multicore CPUs, MPI, classic jobschedulers in cloud/cluster/Grid systems, independent“maps” in MapReduce, and independent “vertices” inDryadLINQ are all such techniques. However, factors suchas the granularity of the independent tasks, the amount ofdata/compute intensiveness of tasks determine the applic-ability of these technologies to the problem at hand. Forexample, in CAP3, each task performs highly computeintensive operation on a single input file and typically theinput files are quite small in size (compared to highly data-intensive applications), which makes all the aforementionedtechniques applicable to CAP3. On the other hand, thepairwise distance calculation application that requires areduction (or combine) operation can easily be implemen-ted using MapReduce programming model.

To clarify the benefits of the cloud technologies, we willconsider the following pseudocode segments representingHadoop, DryadLINQ, and MPI implementations of thepairwise distance (Smith Waterman dissimilarities) calcula-tion application (discussed in Section 3.1.2).

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 1007

Fig. 14. Virtualization overhead of Hadoop Cap3 on Xen virtual

machines.

TABLE 1Systematic Error of SWG and Cap3 Applications

Page 11: Cloud Technologies for Bioinformatics Applications

In Hadoop implementation, the map task calculates theSmith Waterman distances for a given block of sequenceswhile the reduce task combine these blocks to produce rowblocks of the final matrix. The MapReduce programmingmodel executes map and reduce tasks in parallel (in twoseparate stages) and handles the intermediate data transfersbetween them without any user intervention. The input data,intermediate data, and output data are all stored and accessedfrom the HDFS making the entire application highly robust.

The DryadLINQ implementation resembles a morequery style implementation in which the parallelism iscompletely abstracted by the LINQ operations, which areperformed in parallel by the underlying Dryad runtime.The “PerformAlignments” function has a similar capabilityto a map task and the “PerformMerge” function has asimilar capability to the reduce task in MapReduce.DryadLINQ produces a DAG for this program in whichboth “SelectMany” and the “Apply” operations are per-formed as a collection of parallel vertices (tasks). The DAGis then executed by the underlying Dryad runtime.DryadLINQ implementation uses Windows shared direc-tories to read input data, store and transfer intermediateresults, and store the final outputs. With replicated datapartitions, DryadLINQ can also support fault tolerance forthe computation.

Input parameters of the MPI application include aFASTA file, thread, process per node, and node count. Inthe data decomposition phase, subsets of sequences areidentified as described in Section 3.1.5. Corresponding toMPI runtime architecture in Fig. 6b, we use MPI.net [18]API to assign parallel tasks to processes (default is one percore). One can use fewer processes per node and multi-threading in each MPI process (and this is most efficientway to implement parallelism for MDS and PWClustering)but we will not present these results here. Each computenode has a copy of the input file and output results writtendirectly to a local disk. In this MPI example, there is aBarrier call followed by the scatter communication at theend of computing each block. This is not the most efficientalgorithm if the compute times per block are unequal butwas adopted to synchronize the communication step.

The code segments (Fig. 15, Hadoop MapReduce, Fig. 16,DryadLINQ, and Fig. 17, MPI) show clearly how the higherlevel parallel runtimes such as Hadoop and DryadLINQhave abstracted the parallel programming aspects from theusers. Although the MPI algorithm we used for SW-Gcomputation uses only one MPI communication construct(MPI_Scatter), in typical MPI applications, the programmerneeds to explicitly use various communication constructs tobuild the MPI communication patterns. The low-level

1008 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

Fig. 15. Code segment showing the MapReduce implementation of

pairwise distance calculation using Hadoop.

Fig. 16. Code segment showing the DryadLINQ implementation of

pairwise distance calculation.

Page 12: Cloud Technologies for Bioinformatics Applications

communication contracts in MPI supports parallel algo-rithms with variety of communication topologies. However,developing these applications require great amount ofprogramming skills as well.

On the other hand, high-level runtimes provide limited

communication topologies such as map-only or map

followed by reduce in MapReduce and DAG base execution

flows in DryadLINQ making them easier to program.

Added support for handling data and quality of services

such as fault tolerance make them more favorable to

develop parallel applications with simple communication

topologies. Many-task computing is an ideal match for

these parallel runtimes.There are some important differences such as MPI being

oriented toward memory to memory operations whereas

Hadoop and DryadLINQ are file-oriented. This differencemakes these new technologies far more robust and flexible.However, the file orientation implies that there is muchgreater overhead in the new technologies. This is a not aproblem for initial stages of data analysis where file I=O isseparated by a long processing phase. However, asdiscussed in [12], this feature means that one cannotexecute efficiently on MapReduce and traditional MPIprograms that iteratively switch between “map” and“communication” activities. We have shown that anextended MapReduce programming model named Twister[19], [7] can support both classic MPI and MapReduceoperations. Twister has a larger overhead than good MPIimplementations but this overhead does decrease to zero asone runs larger and larger problems.

6 RELATED WORK

There have been several papers discussing data analysisusing a variety of cloud and more traditional cluster/Gridtechnologies with the Chicago paper [20] influential inposing the broad importance of this type of problem. TheNotre Dame all pairs system [21] clearly identified the“doubly data parallel” structure seen in all of our applica-tions. We discuss in the Alu case that the linking of an initialdoubly data parallel to more traditional “singly dataparallel” MPI applications. BLAST is a well-known doublydata parallel problem and has been discussed in severalpapers [22], [23]. The Swarm project [6] successfully usestraditional distributed clustering scheduling to address theEST and CAP3 problem. Note that approaches like Condorcan have significant start-up time dominating performance.For basic operations [24], we find Hadoop and Dryad getsimilar performance on bioinformatics, particle physics, andthe well-known kernels. Wilde et al. [25] have emphasizedthe value of scripting to control these (task parallel)problems and here DryadLINQ offers some capabilitiesthat we exploited. We note that most previous work hasused Linux-based systems and technologies. Our workshows that Windows HPC server-based systems can also bevery effective.

7 CONCLUSIONS

We have studied two data analysis problems with threedifferent technologies. They have been looked on machineswith up to 768 cores with results presented here run on256 core clusters. The applications each start with a “doublydata-parallel” (all pairs) phase that can be implemented inMapReduce, MPI, or using cloud resources on demand. Theflexibility of clouds and MapReduce suggest they willbecome the preferred approaches. We showed how one cansupport an application (Alu) requiring a detailed outputstructure to allow follow-on iterative MPI computations.The applications differed in the heterogeneity of the initialdata sets but in each case good performance is observed,with the new cloud MapReduce technologies competitivewith MPI performance. The simple structure of the data/compute flow and the minimum intertask communicationalrequirements of these “pleasingly parallel” applicationsenabled them to be implemented using a wide variety of

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 1009

Fig. 17. Code segment showing the MPI implementation of pairwise

distance calculation.

Page 13: Cloud Technologies for Bioinformatics Applications

technologies. The support for handling large data sets, theconcept of moving computation to data, and the betterquality of services provided by the cloud technologies,simplify the implementation of some problems overtraditional systems. We find that different programmingconstructs available in MapReduce, such as independent“maps” in MapReduce, and “homomorphic Apply” inDryadLINQ, are suitable for implementing applications ofthe type we examine. In the Alu case, we show thatDryadLINQ and Hadoop can be programmed to preparedata for use in later parallel MPI/threaded applicationsused for further analysis. We performed tests usingidentical hardware for Hadoop on Linux, Hadoop on Linuxon Virtual Machines, and DryadLINQ on HPCS onWindows. These show that DryadLINQ and Hadoop getsimilar performance and that virtual machines give over-heads of around 20 percent. We also noted that support ofinhomogeneous data is important and that Hadoopcurrently performs better than DryadLINQ unless onetakes steps to load balance the data before the staticscheduling used by DryadLINQ. We compare the ease ofprogramming for MPI, DryadLINQ, and Hadoop. TheMapReduce cases offer higher level interface and the userneeds less explicit control of the parallelism. The Drya-dLINQ framework offers significant support of databaseaccess but our examples do not exploit this.

ACKNOWLEDGMENTS

The authors wish to thank their collaborators from Biologywhose help was essential. In particular, Alu work is withHaixu Tang and Mina Rho from Bioinformatics at IndianaUniversity and the EST work is with Qunfeng Dong fromCenter for Genomics and Bioinformatics at Indiana Uni-versity. The authors appreciate all SALSA group members,especially, Dr. Geoffrey Fox, Scott Beason, and StephenTak-Lon Wu, for their contributions. The authors wouldlike to thank Microsoft for their collaboration and support.Tony Hey, Roger Barga, Dennis Gannon, and ChristophePoulain played key roles in providing technical support.

REFERENCES

[1] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Proces-sing on Large Clusters,” Comm. ACM, vol. 51, no. 1, pp. 107-113,2008.

[2] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:Distributed Data-Parallel Programs from Sequential BuildingBlocks,” Proc. European Conf. Computer Systems, Mar. 2007.

[3] Y. Yu, M. Isard, D. Fetterly, M. Budiu, �U. Erlingsson, P. Gunda,and J. Currey, “DryadLINQ: A System for General-PurposeDistributed Data-Parallel Computing Using a High-Level Lan-guage,” Proc. Symp. Operating System Design and Implementation(OSDI), 2008.

[4] Apache Hadoop, http://hadoop.apache.org/core/, 2010.[5] X. Huang and A. Madan, “CAP3: A DNA Sequence Assembly

Program,” Genome Research, vol. 9, no. 9, pp. 868-877, 1999.[6] S.L. Pallickara, M. Pierce, Q. Dong, and C. Kong, “Enabling Large

Scale Scientific Computations for Expressed Sequence TagSequencing over Grid and Cloud Computing Clusters,” Proc.Eighth Int’l Conf. Parallel Processing and Applied Math. (PPAM),2009.

[7] G. Fox, S.H. Bae, J. Ekanayake, X. Qiu, and H. Yuan, “Parallel DataMining from Multicore to Cloudy Grids,” Proc. High PerformanceComputing and Grids Workshop (HPC ’08), http://grids.ucs.india-na.edu/ptliupages/publications/CetraroWriteupJan09_v12.pdf,2008.

[8] M.A. Batzer and P.L. Deininger, “Alu Repeats and HumanGenomic Diversity,” Nature Rev. Genetics, vol. 3, no. 5, pp. 370-379, 2002.

[9] C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D.Thain, “All-Pairs: An Abstraction for Data Intensive Computingon Campus Grids,” IEEE Trans. Parallel and Distributed Systems,vol. 21, no. 1, pp. 33-46, Jan. 2010, DOI 10.1109/TPDS.2009.49.

[10] J. Ekanayake, A.S. Balkir, T. Gunarathne, G. Fox, C. Poulain, N.Araujo, and R. Barga, “DryadLINQ for Scientific Analyses,” Proc.Fifth IEEE Int’l Conf. e-Science, 2009.

[11] Source Code. Smith Waterman Software, http://jaligner.sourceforge.net/, 2010.

[12] G. Fox, X. Qiu, S. Beason, J.Y. Choi, M. Rho, H. Tang, N.Devadasan, and G. Liu, “Biomedical Case Studies in DataIntensive Computing,” Proc. First Int’l Conf. Cloud Computing(CloudCom ’09), Keynote talk, 2009.

[13] A.F.A. Smit, R. Hubley, and P. Green, “Repeatmasker,” http://www.repeatmasker.org, 2004.

[14] J. Jurka, “Repbase Update: A Database and An Electronic Journalof Repetitive Elements,” Trends Genetics, vol. 9, pp. 418-420, 2000.

[15] O. Gotoh, “An Improved Algorithm for Matching BiologicalSequences,” J. Molecular Biology, vol. 162, pp. 705-708, 1982.

[16] T.F. Smith and M.S. Waterman, “Identification of CommonMolecular Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981.

[17] X. Huang and A. Madan, “CAP3: A DNA Sequence AssemblyProgram,” Genome Research, vol. 9, no. 9, pp. 868-877, 1999.

[18] MPI.Net: High-Performance C# Library for Message Passing,http://www.osl.iu.edu/research/mpi.net/, 2010.

[19] J. Ekanayake and S. Pallickara, “MapReduce for Data IntensiveScientific Analysis,” Proc. Fourth IEEE Int’l Conf. eScience, pp. 277-284, 2008.

[20] Raicu, I.T. Foster, and Y. Zhao, “Many-Task Computing for Gridsand Supercomputers,” Proc. IEEE Workshop Many-Task Computingon Grids and Supercomputers (MTAGS), pp. 1-11, 2008.

[21] C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D.Thain, “All-Pairs: An Abstraction for Data Intensive Computingon Campus Grids,” IEEE Trans. Parallel and Distributed Systems,vol. 21, no. 1, pp. 33-46, Jan. 2010, DOI 10.1109/TPDS.2009.49.

[22] M.C. Schatz, “CloudBurst: Highly Sensitive Read Mapping withMapreduce,” Bioinformatics, vol. 25, no. 11, pp. 1363-1369, 2009,doi:10.1093/bioinformatics/btp236.

[23] K. Keahey, R. Figueiredo, J. Fortes, T. Freeman, and M. Tsugawa,“Science Clouds: Early Experiences in Cloud Computing forScientific Applications,” Proc. Workshop Cloud Computing andApplications (CCA), 2008.

[24] J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, and G. Fox, “HighPerformance Parallel Computing with Clouds and Cloud Tech-nologies,” Cloud Computing and Software Services: Theory andTechniques, CRC Press, http://grids.ucs.indiana.edu/ptliupages/publications/cloud_handbook_final-with-diagrams.pdf, 2010.

[25] M. Wilde, I. Raicu, A. Espinosa, Z. Zhang1, B. Clifford, M.Hategan, S. Kenny, K. Iskra, P. Beckman, and I. Foster, “Extreme-Scale Scripting: Opportunities for Large Task Parallel Applica-tions on Petascale Computers,” J. Physics: Conf. Series, vol. 180,2009, DOI: 10.1088/1742-6596/180/1/012046.

[26] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R.Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art ofVirtualization,” Proc. ACM Symp. Operating System Principles, 2003.

[27] L. Youseff and R. Wolski, “Evaluating the Performance Impact ofXen on MPI and Process Execution For HPC Systems,” Proc. FirstInt’l Workshop Virtualization Technology in Distributed Computing,2006.

[28] E. Walker, “Benchmarking Amazon EC2 for High-PerformanceScientific Computing,” http://www.usenix.org/publications/login/2008-10/openpdfs/walker.pdf, 2008.

[29] J. Qiu, J. Ekanayake, T. Gunarathne, J.Y. Choi, S.-H. Bae, Y.Ruan, S. Ekanayake, S. Wu, S. Beason, G. Fox, M. Rho, andH. Tang, “Data Intensive Computing for Bioinformatics,” to bepublished in Data Intensive Distributed Computing, IGI Publish-ers, http://grids.ucs.indiana.edu/ptliupages/publications/DataIntensiveComputing_BookChapter.pdf, 2010.

1010 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011

Page 14: Cloud Technologies for Bioinformatics Applications

Jaliya Ekanayake is a PhD candidate at theSchool of Informatics and Computing of IndianaUniversity, Bloomington. His research advisor isProf. Geoffrey Fox. His PhD research focuseson architecture and performance of runtimeenvironments for data-intensive scalable com-puting. He is a student member of the IEEE.

Thilina Gunarathne is a PhD candidate atthe School of Informatics and Computing ofIndiana University, Bloomington, advised byProfessor Geoffrey Fox. He works in theCommunity Grids Lab as a research assis-tant. His research interests include parallelprogramming architectures and their composi-tion. He is a student member of the IEEEand the IEEE Computer Society.

Judy Qiu is an assistant professor at the Schoolof Informatics and Computing of Indiana Uni-versity, Bloomington. She leads the SALSAgroup in data-intensive and multicore systems.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

EKANAYAKE ET AL.: CLOUD TECHNOLOGIES FOR BIOINFORMATICS APPLICATIONS 1011