24
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

Embed Size (px)

Citation preview

Page 1: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

1

Using SWARM service to run a Grid based EST Sequence Assembly

Karthik Narayan

Primary Advisor : Dr. Geoffrey Fox

Page 2: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

2

Outline

Objective EST Sequence Assembly The Problem SWARM Tools Results Future Work

Page 3: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

3

Objective

Use the SWARM service and leverage the High Performance clusters for EST Sequence Assembly.

Page 4: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

4

EST Sequence Assembly ESTs are a collection of random cDNA

sequences, sequenced from a cDNA library.

The ESTs are clustered and assembled to form contigs.

The contigs are then used to identify potential unknown genes, by Blasting against a known protein database.

Page 5: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

5

The Problem The input is typically large, of the order

of 1 million sequences. Memory intensive Time consuming Involves multiple programs

Page 6: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

6

SWARM A high-level job scheduling Web service

framework, developed by the Pervasive Technology Institute – Indiana University.

Can submit millions of jobs to several high performance clusters and monitor their status.

extensible, lightweight, and easily installable on a desktop or small server.

Page 7: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

7

ToolsTask Tools

Cleaning sequence reads Repeat Masker

Clustering sequence reads PaCE

Assemble reads Cap3

Similarity search Blast

Page 8: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

8

Repeat Masker Developed by Institute of Systems

Biology Screens sequences for interspersed

repeats and low complexity regions. Sequence comparisons done by

cross_match Splitting of input to buckets Post processing step

Page 9: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

9

CAP3 Developed by Department of Computer

Science, Michigan Technological University.

CAP3 is very memory intensive and cannot be run on small servers.

Page 10: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

10

PaCE Developed by Department of Computer

Science, Iowa State University. Clusters ESTs on parallel computers Post-Processing step

Page 11: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

11

CAP3 Since the clustering step is done, the

load for CAP3 is considerably less, but not trivial.

No. of Sequences No. of Clusters by PaCE

10000 974

20000 2412

150000 12544

Page 12: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

12

PaCE Clusters

1 10 100 1000 100000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

PaCE Clusters for 150K ESTs

Series1

No. Of Clusters

No.

Of

Sequences

Page 13: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

13

CAP3 Sort the input files,

and submit the Cap3 jobs both ways.

Page 14: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

14

CAP3 Set a threshold, and

submit the files with number of sequences less than the threshold to the local machine and the others to GRID.

20000 1500000

2000

4000

6000

8000

10000

12000

Grid JobsLocal Jobs

Page 15: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

15

CAP3 CAP3 Job Distribution after clustering of

clusters for 2 million sequences

20000000

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Before ClusteringAfter Clustering

Page 16: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

16

BLAST NCBI BLAST for homology search Splitting of input to buckets If Complete, update the status for the

pipeline in the database, zip the output files and email to the User.

Page 17: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

17

Workflow Login and select

the programs one wants to run from the list of available programs.

Page 18: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

18

Workflow Enter the parameters for the selected

programs.

Page 19: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

19

Workflow Upload the required files, if any. The job is then submitted to the Swarm

service and a status message is displayed.

An email is sent to the user, once the job is completed.

Page 20: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

20

Results Assembly results for 2million sequences

No. of Sequences

Runtime for PaCE

No. of Clusters by PaCE

No. of jobs for CAP3

Runtime for CAP3

Total Runtime

2000000

01:22 hours

75460 4073 25:44 hours 27:06 hours

Page 21: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

21

Results Runtime for the entire pipeline for 2 million

sequences

Program No. Of Jobs Run time

Repeat Masker 1000 11:56

PaCE 1 01:22

CAP3 4073 25:44

BLAST 893 49:00

Page 22: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

22

Validation The Assembly results for Daphnia pulex,

assembled using Swarm was compared to the assembly results of EST Piper.

Comparison of Blast results with hits greater than e value of 2 are as follows :

No. Name EST Piper

Swarm

1 Number Of Contigs 17465 20803

2 Number of hits 13216 15747

3 No. of unique top hit genes

9221 10329

Page 23: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

23

Validation Number of genes commonly identified were

7045. That is, Swarm predicted 76.4% of the genes predicted by assembly using EST Piper.

There were 3284 genes identified by Swarm but not EST Piper.

Page 24: Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

24

Future Work Implement assembly programs like

MIRA for next-gen sequences. Try different job scheduling strategies. Use cloud computing resources.