77
Tavaxy Integrating Taverna and Galaxy with Cloud Computing Support Mohamed Abouelhoda Nile University Egypt 1

Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Embed Size (px)

DESCRIPTION

Mohamed Abouelhoda's talk at ISCB-Asia on Next Generation Workflow Systems on the Cloud: The Tavaxy System. December 19th 2012

Citation preview

Page 1: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Tavaxy Integrating Taverna and Galaxy with Cloud

Computing Support

Mohamed Abouelhoda

Nile University

Egypt

1

Page 2: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Workflows in Bioinformatics

Get Protein Sequence

Clustalw

T-Coffee

Finding Homologous Sequences

BLASTp Get Id &

sequences

2

Page 3: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Implementing Scientific Workflows

Method 1: Write Python/Perl/Shell script

Advantages • Reliable and efficient • Comprehensive programming capabilities (conditionals, loops, etc..)

Disadvantages • Requires programming skills , especially with HPC resources • Scripts are workflow-specific • Costly to create, debug and modify • Requires installing and managing tools

3

Page 4: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Implementing Scientific Workflows

Methods 2: Use of Workflow Systems

http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

Most popular in Bioinformatics

4

• Kepler • Triana • WildFire • GenePattern • Pegsus • Taverna • Galaxy • and many more

Page 5: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Workflow Environment

Tool Library

Canvas for workflow editing

5

Page 6: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Create new workflow and give it a name

6

Page 7: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Click to create input node

7

Page 8: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Input node created, will read input from

user account

8

Page 9: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Choose BLAST tool and click to create it

BLAST node is custom

one added to the system

9

Page 10: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

1. BLAST node created

2. Window opened to set BLAST parameters

10

Page 11: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Connect output of input node as input to BLAST

11

Page 12: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Complete composing workflow

12

Page 13: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Start running the workflow

13

Page 14: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Galaxy

Screen showing parts of the output files and provides links to them

14

Page 15: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Benefits of Workflow Systems

• Intuitive abstract means for describing computational experiments

• Requires no programming expertise

• Easy to modify

• Hide execution details: invocation and scheduling

• Direct use of parallel architectures

Accelerates computation

15

Page 16: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Benefits of Workflow Systems

• Come with library of tools or access directories Tool accessibility

• Store experiment details (used tools, their parameters, and used data) Reproducibility

• Share workflows with analysis history including intermediate results Transparency

Formalize computation

16

Mesirov. Science, 2010 B. Giardine, et al. Genome Research, 2005.

Page 17: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Implementing Scientific Workflows

Methods 2: Use of Workflow Systems

http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems

Most popular in Bioinformatics

17

• Kepler • Triana • WildFire • GenePattern • Pegsus • Taverna • Galaxy • and many more

Page 18: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Example: Homology Workflow

Read Protein Sequence

Clustalw

T-Coffee

BLASTp Get Id &

sequences

18

Page 19: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Open the workflow editor, it is desktop

application

19

Page 20: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

•Choose a web-service and copy its URL if not in the service directory, •Here we will use BLAST at EBI

20

Page 21: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Paste service URL so that Taverna collects

information about it (i.e., methods and parameters)

21

Page 22: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Service information

retrieved

22

Page 23: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Create node for the BLAST service

23

Page 24: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

24

Composing a Workflow using Taverna

Create node to receive input

24

Page 25: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Create nodes to prepare input and parameters in XML

25

Page 26: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Create node to receive Job ID from

service provider

26

Page 27: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Input data and parameters are merged in

XML file

27

Page 28: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Create sub-workflow to check status of the asynchronous

service

28

Page 29: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Rename it Poll status

29

Page 30: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Put in place after job submission

30

Page 31: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Use get result method of the web-service to retrieve data

when ready

31

Page 32: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Get result are created

32

Page 33: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Start putting Get_Result in place in the workflow

33

Page 34: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Specify output type (BLAST xml)

34

Page 35: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

Start retrieval when check status is done

Short cut to sub-workflow

35

Page 36: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Composing a Workflow using Taverna

The rest of the workflow

Read Protein Sequence

Clustalw

T-Coffee

BLASTpGet Id &

sequences

36

Page 37: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Taverna vs. Galaxy

D. Hull, et al. Nucleic Acids Research, 2006. B. Giardine, et al. Genome Research, 2005.

T. Oinn, et al. Bioinformatics, 2004. 37

Taverna Galaxy

Purpose General Bioinformatics

Interface Desktop Web-interface

Usability More difficult Easy

Engine Control flow Data flow

Control constructs Yes No

Jobs Web-service Local invocation

Programs Service directory Library of tools

Use of Local HPC Threads only Threads/Cluster

Taverna Galaxy

Page 38: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Communities

Taverna MyExperiment Galaxy Pages

PIC

38

What if we can make use of both repositories and what if we can have

advantages of both systems??

Page 39: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Taverna

39

Galaxy

Integrating Taverna and Galaxy Workflows

Page 40: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Tavaxy

40

Integrating Taverna and Galaxy Workflows

Page 41: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Tavaxy

• A standalone workflow system based on workflow patterns

• Integrates both Taverna and Galaxy workflows and improve their

performance

• Easy to use and includes a large library of software tools

• Exploits high performance computing resources (parallel

infrastructure) with all details being hidden

• Runs on local or cloud computing infrastructure

• Optimized to handle large datasets

41

Page 42: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Tavaxy Environment

42

Page 43: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Input node

Pattern nodes

Sub-workflow

nodes

Tool node

Tavaxy Nodes (Processing Units)

• If-else conditional • Iteration • Data Merge • Data Select

• Pattern nodes, Sub-workflow nodes • Remote execution

New in Tavaxy and not in Galaxy

43

• Easier to use interface • Direct use of local HPC

New in Tavaxy and not in Taverna

Page 44: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

• EMBOSS: Package of sequence analysis tools

• SAMtools: Package of scirpts and programs to handle NGS data

• Fastx: Package for manipulating fasta files

• Galaxy tools: A set of data utilities and tools developed by the galaxy team

• Taverna tools: A set of tools based on web-services based on Taverna. collection This section includes as well a set of data manipulation utilities developed by the taverna team.

• Tavaxy tools: This section includes extra utilities and tools for sequence analysis and genome comparison developed/added by the Tavaxy team.

• Cloud utilities: A set of data manipulation and configuration of computing infrastructure on the Amazon cloud.

220 Tools organized in the following categories

Tools in Tavaxy

44

Page 45: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

NGS Mapping Tools and Databases in Tavaxy

• MAQ

• BFAST

• BWA

• Bowtie

• …

Tools

• Reference human genome hg36

• Indexed genome for MAQ, Bowtie

• NCBI nucleotide/protein DBs

Databases

45

Page 46: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Data Patterns of Tavaxy

To facilitate execution of data-intensive tasks in cloud computing cluster

46

Page 47: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Domenstration 1 Importing Taverna Workflow

47

Page 48: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

The Tavaxy Project

Taverna or Galaxy workflows can be o imported , open and re-draw in Tavaxy environment o re-designed, delete or re-order workflow nodes o enhanced, add new nodes and sub-workflows o optimized,

Tavaxy concstructs can be used to exploit parallelization remote Taverna calls can be replaced with local tools

o and executed in Tavaxy on local (HPC) infrastructure on cloud

Single environment for integrating Galaxy and Taverna workflows:

48

Page 49: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Importing Taverna Workflow

49

Taverna Workflow

Page 50: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

50

Taverna Workflow

Page 51: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

51

Imported Taverna Workflow

Page 52: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

53

Optimized Imported Taverna Workflow

Page 53: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Idea of Integration

Workflow Patterns

54

- Taverna is used as a secondary engine to execute Taverna sub-workflows - The use of patterns as special nodes in the data-flow oriented engine - Simulating control constructs over the data flow digraph representing the workflow - Use of sub-workflow to enable iteration pattern

Bag of techniques

Page 54: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

High Performance Cloud Computing Support

55

Page 55: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

• Whole system instantiation

• Delegating execution of a sub-workflow to the cloud

• Delegating execution of one tool to the cloud

Three Modes for Supporting Cloud

56

Page 56: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

User case scenario

57

1

Cloud Platform

Create master node

Master node creates worker nodes

1

4

3

2

Instance disks are mounted

Connection to S3 is established

Whole System Instantiation

2

S3 Storage

4 3

(Instance Disks) EBS

5 Submit data, command line (and executable) and execute

5

Include Tavaxy tools and DBs

Page 57: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

User case scenario

58

Create master node

Master node creates worker nodes

1

4

3

2

Instance disks are mounted

Connection to S3 is established

Sub-workflow and tool delegation

5 Submit data, command line (and executable) and execute

1

Cloud Platform

2

S3 Storage

4 3

(Instance Disks) EBS

5

Page 58: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Whole System Instantiation

59

Page 59: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Whole System Instantiation

60

Page 60: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Sub-workflow/Tool Instantiation

61

Page 61: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Exploiting Cloud Computing

Two steps: 1- Enter AWS credentials 2- Define your cluster

62

Page 62: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Supporting Task Parallelism

• Branching in the DAG implies independent execution • Independent jobs can run in parallel

I. Parallelism due to branching

Page 63: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Tavaxy Data Patterns

[x1,.., xn]

=[A(x1),..,A(xn)] a

[x1,.., xn]

=[A((x1, y1)),..,A((xn, yn))] a

A

[y1,.., yn] A

B

=[B(A(x1)),..,B(A(xn))] b

Supporting Data Intensive Tasks

• For an input as a list, node A can process the list items in parallel

Data parallelism

Single List Multiple List

Page 64: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Personalized Medicine Workflow

65

Page 65: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Personalized Medicine Workflow

Individual Whole Genome Mapping and Variant Annotation Tonellato-Wall

66

Page 66: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Workflow Sktech

Alignment

67

Variant analysis

Page 67: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Personalized Medicine Workflow on Tavaxy

Read Mapping

Variant Calling

Formatting

SNP/DiseaseDatabase Queries

68

Page 68: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Exploiting Parallelization and Locality of Data

Use of list data pattern

Fork Pattern

Data base are hosted locally

Reads are defined as list which implies

parallel processing of blocks of reads

Parallel Task Execution

(Parallel DBs Queries)

69

Page 69: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Read-Mapping with Crossbow

• Crossbow (based on EMR/Hadoop) is used to map set of human reads to human genome.

• With this cluster, we mounted EBS volumes including the reference human genome files (each including one chromosome)

• Read Datasets: – illumnia reads of around 13 Gbp (47 GB) from from the African

genome,

– the human genome version hg18, build 36

70

Page 70: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Read-Mapping with Crossbow

71

Page 71: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Domenstration 3 Metagenomics Workflow

72

Page 72: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

NGS-based Metagenomics Study

73

Galaxy Workflow Optimized Tavaxy Workflow

Page 73: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

NGS-based Metagenomics Study

74

Mega-blast run on cloud

Page 74: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

NGS-based Metagenomics Study

• The MegaBLAST program is used to annotate a set of sequences coming from a metagenomics experiment.

• With this cluster, we mounted EBS volumes including the NCBI NT database

• Datasets: Windshield dataset composed of two collections of 454 FLX reads: – Trip A: 138575 (25.3 Mbp) Mb104283 (18.8 Mbp)

– Trip B: ) 151000 (30.2 Mbp) 79460 (12.7 Mbp)

75

Page 75: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Bioinformatics Experiment (2)

• We used the windshield dataset which is composed of two collections of 454 FLX reads.

• These reads came from the DNA of the organic matter on the windshield of a moving vehicle that visited two geographic locations (trips A and B).

• For each trip A or B, there are two subsets for the left and right part of the windshield.

• The number of reads are 66575 (12.3 Mbp) 71000 (14.2 Mbp) 104283 (18.8 Mbp) 79460 (12.7 Mbp) for trips A Left, B Left, A Right, and B Right, respectively.

• The queries were queued in parallel, fetch strategy was used

76

Page 76: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

• Use of workflow systems provides flexibility and efficiency

• High performance and cloud computing resources are exploited with technical details being hidden

• Future work include – Further performance optimization for execution on local and cloud

infrastructures.

– Supporting multiple cloud providers

– Handling larger data sizes in multi-use environment

Conclusions and Future Work

77

Page 77: Mohamed Abouelhoda: Next Generation Workflow Systems on the Cloud: The Tavaxy System

Thanks for attention

78