37
UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly Department of Computer Science University of Texas - Pan American 6th IEEE International Workshop on Scientific Workflows, June 24, 2012 Was Derived From

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Embed Size (px)

Citation preview

Page 1: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems

Artem Chebotko

Joint work with

E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly

Department of Computer Science

University of Texas - Pan American

6th IEEE International Workshop on Scientific Workflows, June 24, 2012

WasDerived

From

Page 2: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance in eScience Metadata that captures history of an experiment

Problem diagnosis Result interpretation Experiment reproducibility

Scientific Workflow Community Provenance Challenges 2006: understanding and sharing information about

provenance representations and capabilities 2006: interoperability of different provenance 2009: evaluating various aspects of OPM 2010: showcase OPM in the context of novel applications

Open Provenance Model

W3C Provenance Working Group

UTPB – University of Texas Provenance Benchmark

Page 3: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

SWFMS and Provenance

Taverna Kepler View VisTrails, Pegasus Swift

Galaxy Triana OPMProv Karma RDFProv etc.

UTPB – University of Texas Provenance Benchmark

Support provenance collection

Use proprietary of third-party systems to manage provenance

Differ in provenance models, provenance vocabularies, inference support, and query languages.

Page 4: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Management Requirements

Non-functional Data storage and querying efficiency and scalability Inference soundness and completeness

Functional Support of a particular, provenance model, provenance

vocabulary, query type, inference feature, visualization and analysis

No standard way to evaluate provenance systems with respect to these requirements

UTPB – University of Texas Provenance Benchmark

Page 5: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance System Benchmarking Challenges

Well-documented and easy-to-understand datasets

Provenance data in a range of sizes

Provenance data with predefined inferred results that are known to be correct and complete

Test queries

Performance metrics

Result interpretation

Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details)

UTPB – University of Texas Provenance Benchmark

Page 6: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Our Contributions University of Texas Provenance Benchmark (UTPB)

http://faculty.utpa.edu/chebotkoa/utpb/ Focus on scalability and inference

Flexible data generator

27 provenance templates 3 virtual workflows 3 workflow execution scenarios 3 provenance vocabularies

27 test queries in 11 categories

5 performance metrics

UTPB – University of Texas Provenance Benchmark

Page 7: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Talk Outline University of Texas Provenance Benchmark

UTPB Architecture Provenance Templates Provenance Generation UTPB Queries Performance Metrics Interpretation of Benchmark Results

Summary and Future work

UTPB – University of Texas Provenance Benchmark

Page 8: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Architecture

UTPB – University of Texas Provenance Benchmark

Page 9: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Architecture

UTPB – University of Texas Provenance Benchmark

Page 10: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Templates

UTPB – University of Texas Provenance Benchmark

Page 11: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Templates A provenance template is a document that serializes

provenance of one workflow execution according to a particular provenance model and a provenance vocabulary.

Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field.

UTPB currently supports: 1 provenance model (OPM) 3 virtual workflows 3 provenance vocabularies (OPMV, OPMO, OPMX) 3 workflow execution scenarios 1 x 3 x 3 x 3 = 27 provenance templates

UTPB – University of Texas Provenance Benchmark

Page 12: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Virtual Workflow 1 Database Experiment

Processes: 7 Artifacts:14 Accounts: 2 Agents: 1

UTPB – University of Texas Provenance Benchmark

Page 13: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Virtual Workflow 2 Jeans Manufacturing

Processes: 13 Artifacts:18 Accounts: 3 Agents: 2 Several processes use and generate

the same artifacts and are “executed” in parallel

UTPB – University of Texas Provenance Benchmark

Page 14: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Virtual Workflow 3 French Press Coffee

Processes: 15 Artifacts:15 Accounts: 4 Agents: 0 Several branches with

multiple processes are “executed” in parallel

Several processes trigger each other without the record of using or generating artifacts

UTPB – University of Texas Provenance Benchmark

Page 15: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Vocabularies Almost every existing scientific workflow management

system defines its own proprietary model for provenance

Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas.

Open Provenance Model (OPM) – a layer of interoperability OPM Vocabulary OPM Ontology OPM XML Schema

UTPB – University of Texas Provenance Benchmark

Page 16: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Workflow Execution Scenarios successful execution

incomplete execution with an error

successful execution with materialized provenance inferences

UTPB – University of Texas Provenance Benchmark

Page 17: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Generation

UTPB – University of Texas Provenance Benchmark

Page 18: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Generation

UTPB – University of Texas Provenance Benchmark

Page 19: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Generation

UTPB – University of Texas Provenance Benchmark

Page 20: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Generation

# Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0@prefix opmv: <http://purl.org/net/opmv/ns#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix utpb: <http://cs.panam.edu/utpb#> .

utpb:account_black_C0_T0 rdf:type <http://www.w3.org/2004/03/trix/rdfg-1/Graph> .utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact .utpb:denim_C0_T0 rdfs:label "blue" .utpb:andrey_C0_T0 rdf:type opmv:Agent .utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0 .utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0 .

# Default graph<http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .

OPMV

UTPB – University of Texas Provenance Benchmark

Page 21: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Generation# Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0@prefix opmo: <http://openprovenance.org/model/opmo#> .@prefix opmv: <http://purl.org/net/opmv/ns#> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .@prefix owl: <http://www.w3.org/2002/07/owl#> .@prefix utpb: <http://cs.panam.edu/utpb#> .

utpb:account_black_C0_T0 rdf:type opmo:Account .utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact .utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value "blue" .utpb:andrey_C0_T0 rdf:type opmv:Agent .utpb:used1_C0_T0 rdf:type opmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0 .utpb:wgb1_C0_T0 rdf:type opmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0 .

# Default graph<http://cs.panam.edu/utpb#opmGraph_C0_T0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Thing> .

OPMO

UTPB – University of Texas Provenance Benchmark

Page 22: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Provenance Generation<utpb xmlns="http://openprovenance.org/model/opmx#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <dictionary>

<opmGraph id="opmGraph_C0_T0"> </dictionary> <opmGraph id="opmGraph_C0_T0"> <accounts> <account id="account_black"/> </accounts> <artifacts> <artifact id="cuttingMachine"> <account ref="account_black"/> <annotation> <property key="value"> <value>laser</value></property> <property key="label"> <value>Cutting machine</value></property> </annotation> </artifact> </artifacts> <agents> <agent id=“andrey”><account ref="account_black"/></agent> </agents> <dependencies> <used id=“used1”> <effect ref="cutDenim"/> <role id="roleMachine1” value="machine"/> <cause ref="cuttingMachine"/> <account ref="account_black"/> </used>

OPMX

UTPB – University of Texas Provenance Benchmark

Page 23: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries

UTPB – University of Texas Provenance Benchmark

Page 24: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries 27 Queries

11 Categories Graphs Dependencies Artifacts Processes Accounts Agents Roles Values Cross-Graph Queries Inferences Application-Specific

UTPB – University of Texas Provenance Benchmark

Page 25: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries

UTPB – University of Texas Provenance Benchmark

Page 26: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries

UTPB – University of Texas Provenance Benchmark

Page 27: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB QueriesType Format Sample Query

English Find all artifact derivation dependencies in a particular provenance graph

SPARQL OPMV

SELECT ?causeArtifact ?effectArtifactFROM NAMED <http://cs.panam.edu/utpb#opmGraph_C0_T0>WHERE { GRAPH utpb:opmGraph { ?effectArtifact opmv:wasDerivedFrom ?causeArtifact . } }

SPARQL OPMO

SELECT ?causeArtifact ?effectArtifactFROM NAMED <http://cs.panam.edu/utpb#opmGraph_C0_T0>WHERE { GRAPH utpb:opmGraph { ?wdf rdf:type opmo:WasDerivedFrom . ?wdf opmo:cause ?causeArtifact . ?wdf opmo:effect ?effectArtifact . }}

XQuery OPMX

declare default element namespace "http://openprovenance.org/model/opmx#";<result> {for $wdf in /utpb/opmGraph[@id="opmGraph_C0_T0"]/dependencies/wasDerivedFromreturn <wasDerivedFrom>{$wdf/effect}{$wdf/cause}</wasDerivedFrom>} </result>

UTPB – University of Texas Provenance Benchmark

Page 28: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries

effectArtifact causeArtifact---------------------------------------------utpb:denimParts_C0_T0 utpb:denim_C0_T0utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0utpb:jeans_C0_T0 utpb:labels_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0

OPMV

UTPB – University of Texas Provenance Benchmark

Page 29: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries

effectArtifact causeArtifact---------------------------------------------utpb:denimParts_C0_T0 utpb:denim_C0_T0utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0utpb:washedJeans_C0_T0 utpb:rawJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:buttons_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0utpb:jeans_C0_T0 utpb:labels_C0_T0utpb:inspectedJeans_C0_T0 utpb:washedJeans_C0_T0utpb:qualityJeans_C0_T0 utpb:buttonedJeans_C0_T0

OPMO

UTPB – University of Texas Provenance Benchmark

Page 30: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

UTPB Queries

<result xmlns="http://openprovenance.org/model/opmx#" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <wasDerivedFrom> <effect ref="denimParts_C0_T0"/> <cause ref="denim_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="denimParts_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="rawJeans_C0_T0"/> <cause ref="sewingThread_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="washedJeans_C0_T0"/> <cause ref="rawJeans_C0_T0"/> </wasDerivedFrom> … <wasDerivedFrom> <effect ref="inspectedJeans_C0_T0"/> <cause ref="washedJeans_C0_T0"/> </wasDerivedFrom> <wasDerivedFrom> <effect ref="qualityJeans_C0_T0"/> <cause ref="buttonedJeans_C0_T0"/> </wasDerivedFrom></result>

OPMX

UTPB – University of Texas Provenance Benchmark

Page 31: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Performance Metrics

UTPB – University of Texas Provenance Benchmark

Page 32: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Performance Metrics Data loading time

Repository size

Query response time

Query soundness

Query completeness

UTPB – University of Texas Provenance Benchmark

Page 33: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Interpretation of Benchmark Results

UTPB – University of Texas Provenance Benchmark

Page 34: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Interpretation of Benchmark Results

Comparison across datasets of varying sizes

Comparison using a fixed dataset

Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO)

Comparison across data managed using different technologies (e.g., RDF vs. XML)

Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future

UTPB – University of Texas Provenance Benchmark

Page 35: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Summary and Future Work

UTPB – University of Texas Provenance Benchmark

Page 36: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

Summary and Future Work

UTPB: A first formal benchmark for scientific workflow provenance management systems

Extensible with new provenance templates

Flexible data generation

Large selection of test queries

Well defined performance metrics

Future work

Benchmarking existing system using UTPB

Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness)

UTPB – University of Texas Provenance Benchmark

Page 37: UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X

THANK YOU! Questions?

UTPB – University of Texas Provenance Benchmark

UTPB website: http://faculty.utpa.edu/chebotkoa/utpb/

My contact information: Artem Chebotko, Department of Computer Science,

University of Texas – Pan American [email protected] http://www.cs.panam.edu/~artem