27
1 Yolanda Gil ([email protected]) USC Information Sciences Institute February 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California http://www.isi.edu/~gil With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz, Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody

1 Yolanda Gil ([email protected])USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

Embed Size (px)

Citation preview

Page 1: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

1Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Metadata Meets Semantic Workflows

Yolanda Gil, PhDInformation Sciences Institute and

Department of Computer ScienceUniversity of Southern California

http://www.isi.edu/~gil

With Ewa Deelman, Jihie Kim, Varun Ratanakar, Christian Fritz,

Paul Groth, Gonzalo Florez, Pedro Gonzalez, Joshua Moody

Page 2: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

2Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Outline

Brief introduction to computational workflows

Brief overview of semantic workflows• The Wings/Pegasus workflow system

Five benefits of semantic workflows• Reproducibility• Validation• Metadata generation• Data discovery• Workflow discovery

Page 3: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

3Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Scientific Data Analysis

Complex processes involving a variety of algorithms/software

Page 4: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

4Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

NSF Workshop on Challenges of Scientific Workflows [Gil et al, IEEE Computer 2007]

Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science:• Reproducibility, key to scientific method, is threatened• Exponential growth in Compute, Sensors, Data storage, Network

BUT growth of science is not same exponential What is missing:

• Perceived importance of capturing and sharing process in accelerating pace of scientific advances

• Process (method/protocol) is increasingly complex and highly distributed

Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself

Workflows need to be first class citizens in science CyberInfrastructure• Enable reproducibility• Accelerate scientific progress by automating processes

Interdisciplinary and intradisciplinary research challenges

Report available at http://www.isi.edu/nsf-workflows06

Page 5: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

5Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Benefits of Workflow Systems [Taylor et al 07]

Managing execution Dependencies among

steps Failure recovery

Managing distributed computation Move data when needed

Managing large data sets Efficiency,

reliability Security and access control Remote job submission

Provenance recording Low-cost high-

fidelity reproducibility

Page 6: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

6Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06])

Input data: a site and an earthquake forecast model• thousands of possible fault ruptures and

rupture variations, each a file, unevenly distributed

• ~110,000 rupture variations to be simulated for that site

High-level template combines 11 application codes

8048 application nodes in the workflow instance generated by Wings

Provenance records kept for 100,000 workflow data products• Generated more than 2M triples of metadata

24,135 nodes in the executable workflow generated by Pegasus, including:• data stage-in jobs, data stage-out jobs, data

registration jobs Executed in USC HPCC cluster, 1820 nodes w/

dual processors) but only < 144 available• Including MPI jobs, each runs on hundreds of

processors for 25-33 hours• Runtime was 1.9 CPU years

Page 7: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

7Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Semantic Workflows

in WINGS Workflow templates Dataflow diagram

• Each constituent (node, link, component, dataset) has a corresponding variable

Semantic properties Constraint

s on workflow variables

(TestData dcdom:isDiscrete false)(TrainingData dcdom:isDiscrete false)

Page 8: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

8Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Semantic Constraints as Metadata Properties

Constraints on reusable template (shown below)

Constraints on current user request (shown above)

[modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]

Page 9: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

9Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Why Semantic Workflows:1) Easily Replicate Previously Published Results

A catalog of carefully crafted workflows of select state-of-the-art methods to cover a wide range of common analyses• Many implementations of same algorithm, some proprietary• Same implementation but new versions and bug fixes

Semantic workflows abstract from software implementation• Representing abstract classes of software components

– Instances are the implemented codes– Workflow steps refer to component classes

• Representing abstract kinds of data (eg exclude format) Semantic reasoning needed to specialize workflow

• To map the abstract workflow into an execution-ready workflow

• To insert lower level steps (eg data transformations)

Page 10: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

10Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

The Importance of Reproducibility

QuickTime™ and a decompressor

are needed to see this picture.

Page 11: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

11Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Difficulties in Replication

Some software is proprietary

Effort must be invested in data conversions

Software installation

Managing new versions

Page 12: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

12Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming]

Work with Christopher Mason from Cornell University

CNV Detection

Variant Discovery from Resequencing

Transmission Disequilibrium Test (TDT)

Association Tests

Page 13: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

13Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]

10MB 2.4 GB

152 MB

32 MB

Running time: 20.5 hrs

Page 14: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

14Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]

Page 15: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

15Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Observations [Gil et al, forthcoming]

Effort involved in reproducing results is minor• 30 seconds to set up a workflow

A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses• Our workflows were independently developed and used “as is”

Semantic representations abstract the analysis method from the software that implements it• Our workflows used different analytic tools than the original studies

Semantic constraints can be added to workflows to avoid analysis errors• Our workflow removes duplicate individuals that would cause problems in the association analysis

Page 16: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

16Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Why Semantic Workflows:2) Ensure Correct Use of State-of-the-Art Methods

Analytic software and methods are well documented but all is text (papers, manuals, etc)• Time consuming, hard to spot interdependencies, no validation

Semantic workflows can check constraints and guide users• Representing requirements of software components

– Constraints on input data– Constraints on parameter settings given properties of input data

• Representing metadata properties of datasets Semantic reasoning needed:

• To check constraints of each workflow step• To propagate constraints across the workflow

Page 17: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

17Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

User’s Difficulties: Choosing Parameters

How do I set up the workflow parameters?

Association Test

Max individuals per cluster (“mc”)and merge distance p-value constraint (“ppc”)

Max Population

If Affimetrix data, set cutoff (“miss”) to 94%, if Illumina 98%

Page 18: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

18Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings Workflow System Assists Users to Set Up Parameters Based on Characteristics of Datasets

PEDFile Data:

• genotype95.ped

• hapmap1.ped

• test.ped

Data Catalog

Component Catalog[MissingnessPerIndividual1: (?c rdf:type pcdom:Create_Binary_PEDFile_Class) (?c pc:hasInput ?idv1) (?idv1 pc:hasArgumentID "PEDFile") (?c pc:hasInput ?idv2) (?idv2 pc:hasArgumentID "MissingnessPerIndividual") (?idv1 dcdom:hasGenotypingRate ?v1) equal(?v1, "0.95"^^xsd:float) -> (?idv2 pc:hasValue "0.06"^^xsd:float)]

Page 19: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

19Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Why Semantic Workflows:3) Automatic Generation of Metadata

Metadata annotations are tedious and involved• Often not done, an obstacle to sharing and to reuse

Semantic workflows can automate the generation of metadata for analysis data products• Representing expected characteristics of output dataset for each software component given the input metadata

• Representing metadata properties of input datasets Semantic reasoning needed:

• To propagate metadata for each workflow step • To propagate metadata across the workflow

Page 20: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

20Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings Metadata Generation: An Example in a Seismic Hazard Workflow [Kim et al 06; Gil et al 07]

SeismogramGration

RVM

127_6.rvm- source_id: 127- rupture_id: 6

Rupture_variationRupture_variation

127_6.txt.variation-s0000-h0000- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1

127_6.txt.variation-s0000-h0000- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1

127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1

127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1

SGT

127_6.txt.variation-s0000-h0000- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1

127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_relaization_#:0- hypo_center_#: 1

FD_SGT/PAS_1/A/SGT161- site_name: PAS- tensor_direction: 1- time_period: A- xyz_volumn_id: 161

127_6.txt.variation-s0000-h0001- source_id: 127- rupture_id: 6- slip_realization_#:0- hypo_center_#: 1

Seismogram

Seismogram_PAS_127_6.grm-site_name: PAS-source_id: 127-rupture_id: 6

… …SGT

Page 21: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

21Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Wings Workflows for Accuracy/Quality Tradeoffs in Biomedical Image Analysis [Kumar et al 09]

PIQ: Pixel Intensity Quantification (from National Center for Microscopy and Imaging Research [Chow et al 06])• Terabyte-sized out-of-core

image data • Need to minimize execution time

while preserving highest output quality

• Some operations are parallelizable, others must operate on entire images

For efficiency, image decomposed (layers, tiles, and chunks) but quality is affected

From a workflow template, Wings can automatically generate descriptions of each individual piece of the image to manage the computations over each one

Page 22: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

22Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Why Semantic Workflows:4) Discovery of Relevant Data

Need a dataset of updated

common (known) locito annotate findings, where can I find one?

Page 23: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

23Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Why Semantic Workflows:5) Retrieval of Workflows

Hard to find workflows for the type of analysis a user wants• Semantic information is not provided when creating the workflow

– e.g., when user adds a NaiveBayesModeler, he wouldn’t be expected to define that the output of this would be a NaiveBayesModel or a Bayes Model (superclass) or not human readable

• However, retrieval queries are often based on metadata properties of data– e.g., “Find workflows that can normalize data which is continuous and has missing

values [<- constraints on inputs] to create a decision tree model [constraint on intermediate data products]”

Semantic representations are needed• For workflow constituents

– Metadata properties of input, intermediate and final data products– Metadata properties of workflow and component function

• For user queries– Express workflow sketches containing partial data descriptions (constraints)

Reasoning capabilities• Automatic creation of metadata for expected workflow data products• Workflow matching to queries (exact and partial)

Page 24: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

24Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

User’s Difficulties: Choosing an Analysis

What type of analysis is appropriate for my data?

CNV Detection

Variant Discovery from Resequencing

Transmission Disequilibrium Test (TDT)

Association Test

TDT analysis requires no less than 100 families

Variant discovery is used for genomic

data from the same individual

Association tests are best for large datasets that are not within a family

Page 25: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

25Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

User’s Difficulties: Choosing a Workflow

What workflow is appropriate for my goals?

Transmission Disequilibrium Test (TDT)

Association Test

Applies population stratification to remove outliers

Assumes outliers have been removed

Uses structured association

Uses a standard test

Incorporates parental phenotype information

Uses CMHassociation

Page 26: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

26Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

An Algorithm for Semantic Enrichment of Workflow Templates [Gil et al K-CAP 09]

?Model5 dcdom:isDiscrete true?Model6 dcdom:isDiscrete true?Model7 dcdom:isDiscrete true

?TestData dcdom:isDiscrete true

?Dataset4 dcdom:isDiscrete true

?Dataset3 dcdom:isDiscrete true

?TrainingData dcdom:isDiscrete true

Model5 Model6 Model7

Problem Addressed: Semantic information is not provided when creating the workflow, but retrieval queries use it

Key idea: Constraints can be available in a component catalog and propagated through the workflow

Phase 1: Goal Regression• Starting from final

products, traverse workflow backwards

• For each node, query component catalog for metadata constraints on inputs

Phase 2: Forward Projection• Starting from input

datasets, traverse workflow forwards

• For each node, query component catalog for metadata constraints on outputs

Page 27: 1 Yolanda Gil (gil@isi.edu)USC Information Sciences InstituteFebruary 4, 2010 Metadata Meets Semantic Workflows Yolanda Gil, PhD Information Sciences Institute

27Yolanda Gil ([email protected])

USC Information Sciences Institute

February 4, 2010

Conclusions: Benefits of Semantic Workflows [Gil JSP-09]

Execution management: Automation of workflow execution

Managing distributed computation

Managing large data sets

Security and access control

Provenance recording Low-cost high fidelity reproducibility

Semantics and reasoning:

“Conceptual” reproducibility

User assistance to explore analysis “design space”

Validation of analyses

Automated generation of metadata

Workflow retrieval and discovery