Upload
genica
View
39
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis. Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering Advisor: Dr. Gagan Agrawal Committee: Dr. Rajiv Ramnath Dr. Michael Freitas. Introduction. Cloud computing Resources on demand - PowerPoint PPT Presentation
Citation preview
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data
AnalysisThesis Defense:
Ashish NagavaramGraduate student
Computer Science and Engineering
Advisor: Dr. Gagan AgrawalCommittee: Dr. Rajiv Ramnath
Dr. Michael Freitas
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
2
Introduction
Cloud computing• Resources on demand• pay-as-you-go• Elasticity
Resource Allocation on the cloud• Dynamic resource allocation
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
3
Motivation
Use elasticity of cloud for executing scientific applications• Over provisioning and Under provisioning• Avoid wastage of resources
No Generalized scientific workflow to execute application in dynamic fashion
Allocate resources during the executionMeet time constraints by using more resources
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
4
Background-MassMatrix
Developed by Dr. Hua Xu and Dr. Michael Freitas at Ohio State University
A database search program with rapid characterization of proteins and peptides• Supports multiple data formats like .mgf, .mzXML and
raw data• The input database are of the formats .fasta or .BAS
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
5
MassMatrix Application FlowTheoretical Protein database
Digest the sequence
Has the sequence been
searched before?
Do not add it to the final result
Full scan search for finding matching peptides
Clear insignificant peptides
Statistical analysis to generate results
results
MS/MS data input file
yes
no
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
6
Contributions (1/2)
Providing a framework for parallelization of the MassMatrix application
Creating a dynamic workflow• Resources are allocated adaptively• QOS is achieved by parameter prediction • Gives user control by using benefit function
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
7
Contributions (2/2)
Allows to specify the time constraint in which the application should be completed
“A cloud-based Dynamic Workflow for Mass spectrometry Data Analysis” - Ashish Nagavaram, Gagan Agrawal, Michael Freitas, Gaurang Mehta 7th IEEE Conference on E-Science, Dec 2011
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
8
Outline
IntroductionMotivationBackgroundParallelization of MassMatrixAdaptive Resource allocationExperimental ResultsParameter PredictionConclusion
9
Parallel MassMatrix
Parallelize the full-scan search phase• Takes the longest time to execute• The rest of the phases are sequential
A split-merge approach is followed• The user can specify the number of splits• Splits are made based on specific tags• Index embedded in the file-split name• Other options also considered
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
10
Parallel MassMatrix (contd.)
Only input file split• When we split database also leads to redundant results• When split both input and database we have the same
problemThe intermediate files are written to disk• Pointers serialized• Written as comma separated values
A python script keeps polling the job queue to check if the parallel phase has been completed• Suspends the sequential phase until then
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
11
Parallel MassMatrix (Contd.)
The intermediate files are read back in and re-indexed while merging
The merging process is complicated• Complex data structures (matrix of matrices)• Have to get inside each data-structure to maximize them• Intermediate files are indexed among each other• While re-indexing maintain both local and global index• The data structures are also re-numbered while merging
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
12
Parallel MassMatrix (contd.)
Intermediate files are merged in order of the split they process
Unnecessary intermediate files are not loaded back• Saves memory• Helps in case of large data files
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
13
MassMatrix Flow (Parallel)
13
Configuration File
Input File
Input Database
Python Script
splitN
split2
split1
Sequential phase
Merge
massmatrix
massmatrix
massmatrix
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
14
Experimental results (Parallelization)
Experimental setup:8 core Intel Xeon node with 6GB of DDR400 RAMThe theoretical database used was of 20 MB• .fasta format database is used
The code was run for 6 different datasets • Each had 50,000 records on average• Is of .mgf format
Experiments are run for 1, 2, 4 and 8 splits• Run on a single node with 8 cores
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
15
Experimental results (Parallelization)
Execution times when datasets are run for 1, 2, 4 and 8 splits
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
16
Experimental results (Parallelization)
Execution times for datasets when run on 1, 2, 4 and 8 cores
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
17
Background (Pegasus)
Used to help creating adaptive version of MassMatrix• Is a software system to manage workflows• Manages resources on local, grid and cloud• Provides API’s to create workflows
Creates a DAG to represent dependencies• DAG has a connection between nodes if there is
dependencyCreates a plan for the execution of the application• Executes application according to this plan.
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
18
Background (Condor)
Uses wrangler to start nodes in the cloud• New nodes added to cluster automatically• Uses Amazon private and public keys to identify user• Configuration specified in xml file
Condor is the job scheduler used• Developed at University of Wisconsin• Jobs are stored in a queue• Jobs submitted from queue to the cluster in FIFO• Provides fault tolerance through check pointing
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
19
The Pegasus workflow
Pegasus workflow showing the workflow of MassMatrix Application
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
20
Parallel Pegasus workflow
Pegasus workflow for parallel version of MassMatrix application
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
21
Adaptive Resource Allocation
An approach for dynamic resource allocation• Decision based on rate of execution • Calculates number of additional resources to meet time
constraint
Initial assumption that input is divided into equal splits
Decision made on the basis of execution time of initial N splits
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
22
Adaptive Resource Allocation (Contd.)
The code initially is run with N resourcesFor our case we used N=4Let Tper_split be the execution time of a single split Tconstraint be the user specified time constraint
Then we can say that
Ttime_constraint = Tconstraint – ( 2 × Tper_split ) (1)
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
23
Adaptive Resource Allocation (Contd.)
Another N splits must have already started execution • Hence we do not consider them in calculation
Hence if we use N resources the predicted execution time is
Texecution_pred = Tper_split × ( {split_count} - 2 × N ) (2)
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
24
Adaptive Resource Allocation (Contd.)
Based on equations (1) and (2) we can calculate the number of needed as
Nodesrequired is the number of additional nodes that need to be spawned
N 1TTNodes
rainttime_const
predictedexecution_required
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
25
Adaptive Algorithm
Algorithm showing the steps involved in calculating the additional resources needed to meet the time constraint
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
26
Experimental Goals
To evaluate efficiency of our system with different datasets
The framework is effective • calculates the additional nodes required• Meets the time constraints• Tested for different time constraints
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
27
Experimental results (Adaptive)
Experimental setup:Cloud infrastructure: Amazon EC2submit host to submit jobs to the cloudPegasus version 3.0.2Condor job scheduler version 7.5.6Results for 2 datasets and different time
constraints
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
28
Experimental Results (contd.)
Results obtained when algorithm is ran for different time constraints on the dataset1
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
29
Experimental Results (contd.)
Results Obtained for dataset2 when run with same time constraints
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
30
Benefit function and Parameter prediction (QOS)
Motivation: Provide Quality of service
• Tradeoff between execution time vs. quality of results• Quality depends on the parameter values• Provide a way for the user to control the quality of results• Quality defined as equation in terms of parameters
User has flexibility to decide which parameter has more importance
Makes prediction such that execution time is as close as possible to time constraint
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
31
Benefit function and Parameter prediction (QOS)
Benefit function - is an equation made of some or all parameters of the application• We use this equation to set the parameter importance• This is the minimal set of equations needed to obtain the required quality
The goal is to maximize this benefit function within the user specified time constraint• Calculated for different parameter combinations
Decision made using tables constructed from data of previous executions• Hash tables are used
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
32
Benefit function and Parameter prediction (QOS)
Tables contain parameter combination to execution time mappings and vice versa
Multiple datasets can be used for prediction • Parameters are mapped to average execution time• Reduces error percentage
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
33
Parameter prediction process
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
34
Experimental Results
Experiments conducted on a Linux desktop machine with 2 cores and 1 GB of memory
The tables are populated using two datasets data1.mgf and data2.mgf
The parameter combinations are predicted for two other datasets data3.mgf and data4.mgf
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
35
Experimental ResultsParameter Prediction results when run for different Benefit function and constraints
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
36
Experimental Results
Parameter Prediction results for a different Benefit Function
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
37
Conclusion
Displayed a framework for dynamic execution of scientific workflows
User specified time constraint can be used to drive the allocation of resources
Effective dynamic allocationMaximizing Benefit function • Parameter prediction within this value• Provide quality results based on user requirements
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis
38
Thank you